The Magic of Distributed Access Control

Aug 31, 2024

Why are distributed systems so hard to design? Well, one challenge is getting them to make decisions. In a centralized model, you can write up a set of rules, hand those rules to a trusted authority, and just let it start arbitrating decisions for you. In a distributed model, you have to weave those rules into the very fabric of the system itself.

Access control is a very common need for all kinds of applications. To be practical, many applications will not only want to restrict certain actions to people who have permission, but they'll also want some way for people to delegate their permissions granularly. In a centralized system, implementing access control is pretty easy: All actions go through a central authority, which at any given time knows the full state of the system. Whenever someone tries to take an action, the authority checks whether the current state allows it, and can immediately decide whether to accept it or reject it. Only if it's accepted does it change the system's state.

In a distributed system, access control gets more complicated. Some systems use network-wide consensus so that they can come to a more or less "final" conclusion about whether an event actually changes the state, and what order it takes with respect to other events. This is how you get blockchains. It works, I guess, but you have to trade off some amount of decentralization or scalability in the process.

Multiple alternative events happen; A tries to give $20 to B in one branch, and $20 to C in another branch, but consensus selects only one of the events, creating a total order. A gives$20 to B A gives$20 to C A gives$20 to B A gives$20 to C Multiple alternative events happen Consensus selects only one,creating a total order

Personally, I'm much more interested in exploring systems without consensus, in which nothing is final and the state is eventually consistent. This is the design space in which we find collaborative, local-first documents.

Peers A and B each make a change to their view of the document, without coordinating with each other; A adds garlic to the shopping list, while B crosses butter off the shopping list. The next time they compare notes, they update their event logs to include both events, creating a partial order. A adds garlic tothe shopping list B crosses butter offthe shopping list A adds garlic tothe shopping list The next time they compare notes, theyupdate their event logs to include both events,creating a partial order Peers A and B each make a change to their viewof the document, without coordinating with each other B crosses butter offthe shopping list

But how can you possibly authorize anything when the system doesn't even guarantee a consistent view of who has what permissions? Whatever the answer, it's going to require a healthy dose of optimism and a touch of magic.

Now, Matrix has already explored this same problem space. Matrix is organized into rooms (chat rooms), with the main data structure of interest here being room state: a flat, replicated key-value store which holds not just "cosmetic" data like the room's name and topic, but also information governing the authorization of changes to the room state itself. The protocol then defines a conflict resolution algorithm which deterministically merges divergent room states together. The specifics of this algorithm are critical to Matrix's eventual consistency and access control guarantees.

There are a few things I find unsatisfying about its design, though. In particular:

My goal here is to show that a simple, principled, and generic approach to access control for local-first applications is within reach. Let's get going!

Take One

For a spell to be effective, you must first give it direction. What makes a good access control system? What does it mean for something to be authorized, when the very state that authorized it might yet shift out from under you?

Let's first review what it means for a centralized access control system to be correct. I think this can be summed up neatly into a single rule:

Principle of Authorization
An event e sent at time t changes the system's state only if the state at t authorizes e.

If we can just get a distributed system to satisfy this same principle, then we should be good, right? Well, there are two major obstacles to doing so.

The first is that this principle assumes the existence of an authoritative clock. In a distributed system there is no such thing—each peer has their own clock, and events can happen concurrently. We can overcome this by requiring the Principle of Authorization to hold for any arbitrary (causally consistent) clock. In theory, this ensures that authorization decisions are fair to all peers and nobody can manipulate the sequence of events to their advantage.

The second challenge is that, since no part of the state is ever final, it's impossible to know whether the state you have for time t is really the full state for time t. However, if we can change our perspective a little, this isn't such a big problem. Whenever you use an eventually consistent system, the best the system can do is be optimistic that it has complete information. So we need to apply the same optimism to authorization: when a peer decides whether an event is authorized, they must base this on the assumption that they already know the full state of the system. If more information arrives later, they must be prepared to alter past authorization decisions—they can never be final!

In other words, as long as peers consider all authorization decisions tentative and ensure that they are consistent with any possible clock, we should expect to get a valid access control system.

Now it's time to give some force to our magic.

Let there be events. Linked to one another by sheer causality. Always growing downward, never cyclic.

Events link to the events that happened directly before them (their parents), forming a directed acyclic graph. An event,sent by a peer Events link to the events that happeneddirectly before them (their parents),forming a directed acyclic graph

Alone, they each hold an effect. Together, they give life to a document, forking and joining in a semilattice dance.

An event's effect describes the change it makes to the document. Events build a document starting from some initial state. In these examples, we'll look at events that draw and erase things on a virtual whiteboard. A draws An event's effect describes thechange it makes to the document A draws A draws B draws B erases Events build a documentstarting from some initial state

Many events are good. But some of them dare to overstep the rules. Tread carefully, we mustn't allow their effects to set in.

A draws a star, and then marks the star as 'do not erase'. This is authorized by the previous document state. B tries to erase the star, but the document says it can't be erased, so it is unauthorized. Unauthorized events have no effect. A draws A marks as'do not erase' B erases Authorized by theprevious document state Unauthorized; document saysthat can't be erased Unauthorized events have no effect

Somewhere a branch breaks free, carrying events with independent destinies. They threaten to circumvent others that would deauthorize them—but you have a plan. In your mind, an image forms: one event happening before the other, the possibilities collapsing to form a single strand. Summon all possible extensions, search them for answers. Do any of them block these events?

A draws a star, then marks it as 'do not erase', and B tries (but fails) to erase it. Then in another branch, C tries to erase the star concurrently with the event that prevents it from being erased. Is it authorized? There are 3 possible total orders: C's event might have happened before the star was marked as 'do not erase', which would be fine, but it also might have happened right after that, or even after the event of B trying to erase the star. So, it is unauthorized. A draws A marks as'do not erase' B erases C erases A draws A marks as'do not erase' B erases C erases A draws A marks as'do not erase' B erases C erases A draws A marks as'do not erase' B erases C erases C tries to erase concurrently with theevent that prevents it from being erased Is is authorized?There are 3 possible total orders: C's event might have actually happened after the 'do not erase'event, so it is unauthorized

There—you catch the anomaly, return everything to its place. Fuse the branches into a single definitive state.

The state for the main branch has the star marked as 'do not erase'. The state for C's branch just has the star, unmarked. Using the CRDT's merge operation to return to a consistent state gets us to a state that does have the star marked as 'do not erase'. A draws A marks as'do not erase' B erases C erases Use the CRDT's merge operation toreturn to a consistent state

So, we can certainly fulfill the Principle of Authorization by checking every possible total ordering of the event graph. If we think of time as a partial order, we've essentially implemented this stronger guarantee:

Continuous Authorization
An event e sent at time t changes the system's state only if the state at every t' which is equal to or concurrent with t authorizes e.

But is this efficient? Not at all. The number of ways to order n independent events grows factorially. Upholding access control guarantees with respect to real time is a noble goal, but sadly it's prohibitively expensive.

Hmm, what if we changed the rules? If we could somehow check each event against only a single state, authorization checks could be much more efficient. Check this out:

Eventual Authorization
An event e sent at time t changes the system's state only if the state at the latest t' which is equal to or concurrent with t authorizes e.

This is weaker than Continuous Authorization in a significant way: it allows an event that would normally have been unauthorized to later become authorized once the state has changed. Intuitively, this should be okay; if someone tries to send an event for which they lack permission, but then they later receive permission, they could just as well have sent the same event upon receiving permission.

Take Two

To enter this realm, we will need a protective charm. It's a simple one, wielded by the likes of Zermelo and Kripke, useful for guarding against any form of self-reference. This time, it reads:

No Cyclic Dependencies
Whether an event e is authorized must not depend on whether e actually changes the system's state.

You cross the threshold. On the surface, this world is very similar to the last one, but you know they work by different rules. Still, you can see that events in linear sequence behave familiarly, even here.

A draws a star, then they draw a trail, and B finally erases the trail. The events are checked against the previous document state, and all of them are authorized. A draws B erases Events are checkedagainst the previousdocument state A draws

You blink, and new events appear. Perhaps they were there all along? One of them blocks its sibling's effects, the other allows them. Which one do you believe?

In a new branch off of the 'A draws a trail' event, A marks the trail as 'do not erase' and then eventually allows the trail to be erased again. A draws B erases A draws A marks as'do not erase' A allows to be erased

The last world would have stopped the event in its tracks, but this one is much more forgiving. Carve a seam into the figure: below it, the event. Above, everything that might have happened before it. This is the state that will judge the event.

A line squiggles between the two branches, leaving B's erasing event below it, and all the other events above it. A draws B erases A draws A marks as'do not erase' A allows to be erased The state given by the events abovethe line determines whether the eventis authorized

But what is that state? It depends on whether the other events are authorized. Carve two more seams, each like the last. The states of these branches form a self-referential spiral. Look for shortcuts, simplifications. These are how you will escape.

Two more views of the event graph with different dividing lines drawn; in one, A's drawing events and B's erasing event are above the line, while A's 'do not erase' event lies below it, and in the other, A's decision to allow the trail to be erased again lies below it, while everything else lies above it. A draws B erases A draws A marks as'do not erase' A allows to be erased A draws B erases A draws A marks as'do not erase' A allows to be erased

Now you are ready. Dive into the spiral and seek the fixed point.

The initial state is that A is an admin. Start with the 'do not erase' event; it is authorized if and only if the state above its line judges A to be an admin, which none of these events could possibly change. Same for the event in which A allows the trail to be erased again; A was initially an admin and is still clearly an admin after the previous events, regardless of whether they're authorized. So, both of these are authorized. For the final event, it is authorized if and only if the state above judges the trail to be eraseable, which, having confirmed that the last event is authorized, we now see that it does. B successfully erases the trail, leaving only the star. A draws B erases A draws A marks as'do not erase' A allows to be erased A draws B erases A draws A marks as'do not erase' A allows to be erased A draws B erases A draws A marks as'do not erase' A allows to be erased Start with the top right event; it isauthorized if and only if the stateabove judges A to be an admin,which none of the events picturedhere could possibly change (Initial state: A is an admin) Same for the bottom right event;A was initially an admin and isstill clearly an admin after theabove events, regardless ofwhether they're authorized Final event; authorized if and only ifthe state above judges to beeraseable, which, having confirmedthat the bottom right event isauthorized, we now see that it does

So this is another possible access control system. To summarize, it gives us these semantics:

An event e changes the state if and only if the state at {e' | ¬(e ≤ e')} authorizes it.

It should be easy enough to see that any access control system implementing this rule will satisfy Eventual Authorization. The state at {e' | ¬(e ≤ e')} reflects all events that happened before e or concurrently with e, which by definition is the latest such available state. The main difficulty, since the authorization checks for concurrent events are mutually recursive, lies in how you reach a fixed point. Lazy data structures are always a good option for computing fixed points.

Demo: Do Not Erase

As a proof of concept, I've created a whiteboard application demonstrating some basic access control rules. The document state is simple enough that I was able to model it as a handful of conflict-free replicated sets, which I treat as a lazy data structure in order to perform authorization queries. Check out the video for details:

The code is available at rad:z2pFBc3GY68cDyCtWx8CsSWDXS7X.

Caveats

The semantics of Continuous Authorization and Eventual Authorization are really sensible, I think, as long as we assume that their notion of 'time' reflects the order in which events actually happen in the real world. But famously, time in distributed systems is more complicated. For one thing, networks are slow, devices go offline, and so the partial order of events given by the system can always end up being weaker than actual causality. There's not much you can do about this; peers can even be dishonest and pretend not to know about events that they have actually received. In a system satisfying Continuous/Eventual Authorization, this means that there are any number of ways to force peers to consider the possibility that an event happened earlier than it actually did.

Let's examine the impact of this problem. With Eventual Authorization, if you are dishonest and send an event that doesn't actually reference the most recent state, peers will still authorize it against the same state either way. (Good, so you still can't circumvent events that restrict your power.)

But it does change the state that other, concurrent events will be authorized against. If you have power, you can abuse this to cause the authorization decisions of earlier events to be changed—even if they happened before you got your power!

A dishonest peer sends an erase event pretending that it hasn't seen the 'do not erase' event from the parallel branch; this fails. A dishonest admin sends a 'do not erase' event pretending it hasn't seen the erase event from the parallel branch; this succeeds and retroactively deauthorizes the erase event. A draws B erases A draws A marks as'do not erase' Dishonest peer sends an eventpretending that it hasn't seen the'do not erase' event; this fails Dishonest admin sends an eventpretending that it hasn't seen theerase event; this succeeds andretroactively deauthorizes theerase event A draws B erases A draws A marks as'do not erase'

Is this really okay? Well, consider that whenever you delegate power to someone, you're trusting them to use that power responsibly. It shouldn't be a stretch to consider this trust as an indicator of their honesty as well. Notably, Matrix makes a similar assumption: it resolves conflicts in favor of the user with the higher power level, on the grounds that they are more likely to be honest. I think this is a reasonable assumption to have as part of the threat model. The most important job of an access control system is to stop people who will never have power from performing restricted actions, and that's not what's at stake here.

One notable consequence of the ability of dishonest peers to roll back their clocks is that you should avoid designing systems which have mutually exclusive events. If, for example, your system allows admins to strip each other of their power, then it will always be possible for admins to enter a battle of "no, I demoted you first—no, you" and evade having their power taken away.

A revokes B's admin rights. B sees this and retaliates by pretending to have revoked A's admin rights at the same time, starting a power struggle. A deops B A revokes B's admin rights A deops B B sees this and retaliates bypretending to have revoked A'sadmin rights at the same time,starting a power struggle B deops A

This constraint is probably a good thing: in centralized systems that allow admins to demote each other, any admin that has their account compromised can initiate a takeover of the system, locking all other admins out of it forever. A better design might require admins to voluntarily demote themselves (as Matrix does), or else retain information about who delegated their power to who, so that people can only be demoted by someone higher up in the power hierarchy.

Other than that, the primary challenge that remains is to make this technique fast. The Do Not Erase demo is completely unoptimized, because I wanted it to read as a direct translation of the Eventual Authorization semantics into runnable code. In particular, it traverses the entire event graph many times over while performing authorization checks! Instead of implementing this in terms of lazy data structures and recursion, a better bet might be to iteratively revise authorization decisions until you reach a fixed point. In general, there seem to be lots of opportunities to reuse previous authorization results and focus only on the parts of the event graph that change, so I feel optimistic.

Closing thoughts

One thing that inhibits the development of new distributed systems is the lack of reusable components. While I think I've managed to distill a reusable technique for access control here, it would be really interesting to see if this can be turned into a reusable library or perhaps even a standard. A dream of mine is to have a standard for distributed documents that define their own presentation logic, business logic, and access control rules. If such a thing is possible, this could provide a basis for collaborative experiences that's as malleable as the open Web, yet as personal as a file stored on your own device—local-first software, but as a reusable platform.

That's about all! Note that I am not a security professional, so you should definitely take my analyses and judgements here with a grain of salt. But I really hope that the idea I've presented here can be of use to future systems needing access control :) In particular, the flexibility and semantic transparency of this approach, as compared to the status quo of Matrix state resolution, appears to be a promising direction for improvement.

References