Hi,
A project I’m involved in as an architect is considering using OPA, the Open Policy Agent, for Authorization (AuthZ) purposes. With permission to ask for advice from a broader audience, below is a summary of my research on using OPA / Rego for AuthZ.
The ideal outcome of this blog post would be to have a few quality conversations about OPA and AuthZ, with people who have used and/or developed OPA and/or the OPA/Rego engine. I would much appreciate it if you could forward this post to the relevant people in your network.
Problem Statement
The use case assumes high-performance low-latency AuthZ policy application.
In our use case, the data related to making AuthZ decisions, as well as the policy rules that govern these decisions, are subject to change at run time, with no AuthZ service downtime.
High performance can generally be achieved by sharding. For low latency, as parts of our code are synchronous, it is imperative that most policy application decisions are evaluated with sub-millisecond latency.
As part of functional requirements, the ultimate AuthZ design should ensure that the policy decisions are by no more than ~five seconds late.
In practice this means that:
Ideally, each AuthZ decision is evaluated on the fly, with <0.1ms latency.
The 2nd best option is to keep a cache of decisions for the active users.
In which case this cache should be behind by at most ~five seconds.
Both when the data changes and when the policy rules change.
The difficulty in the former case, evaluating AuthZ decisions on the fly, comes from:
Having to keep all the relevant data close to the policy application point, and
Tight latency constraints, which eliminate any and all room for external calls.
The difficult part in the latter case is that the cache of the AuthZ decisions should be kept up to date.
Keeping per-{user/identity} and per-resource data cached, and quickly re-evaluating the AuthZ policies as they do change is challenging yet doable.
The impossible part is to update the cached decisions when policy rules change.
A change in per-identity or per-resource data affects a limited number of {identity,resource}-related permissions, which can be easily enumerated.
But a change in policy rules may potentially affect the policy decisions of a large fraction of the currently active (and thus cached) users/identities.
The above concludes the initial problem statement.
On Throughput and Latency
An important addendum on the ~five seconds (cache) staleness policy.
In theory, the ~five seconds constraint on (2.2) above could be loosened on the product level, by agreeing that the propagation time of the large-scale AuthZ policy change can take up to, say, a ~minute to take effect for all (currently active) users.
In practice, IMHO:
It looks suspicious that such a problem would even exist,
When the data required to make policy decisions fits ~single-digit GBs,
And the logic to evaluate a rule fits ~hundreds of CPU instructions,
Which translates to ~millions of policy applications per second,
With sub-microsecond (!) latency.
However, reaching these numbers requires more custom/manual work, and making the decision on the tradeoff between product requirements and the complexity of the ultimate solution is beyond the scope of this research.
Instead, my goal here is to perform a fair evaluation of what the OPA/Rego-based design can and can not deliver.
OPA Integration Options
Below are four possible ways of using OPA that we can think of.
Option 1: Keep all the data in OPA
Intuitively this is the best option:
100% of the data required to make AuthZ decisions is kept in OPA.
The application server is responsible for keeping this data up to date.
Policy application can be performed ~millions of times per second.
Synchronously, and OPA-compiled queries are run 100% locally.
Using OPA as a library allows calling its methods natively, with zero overhead.
If the language in which the application server is developed is Go.
OPA supports pushing data into it with no downtime, by design.
As well as updating this data in real time.
In practice, this approach hits the wall right away due to the extremely bloated memory footprint of the way OPA stores the data.
Populating the OPA internal data structures permanently occupies ~25x .. ~30x more RAM than the JSON input that was pushed into OPA. This factor is cited in OPA documentation; I have confirmed it myself as well as part of my research.
And this ~25x .. ~30x factor does not include the bloat introduced by the very JSON encoding itself. I.e., if the ultimate comparison is as it should be, [“lists”, “of”, “JSON”, “strings”]
to int64
bit masks, the true bloat factor can easily be in the thousands, rendering the very solution infeasible.
In addition to the problems above, my experiments show that pushing data into OPA is slow: it literally takes minutes for OPA to digest a gigabyte-large JSON blob. Of course, in practice, incremental updates would be smaller and faster, but the time of the first, cold, start is an important metric to be considered as well.
Option 2: Augment each request to OPA with relevant data
The second option is to populate each request to OPA with the relevant bits of data.
It is the next best option since OPA can not handle the volume of data to be stored without unnecessary bloat. Keep the data in the application server, identify the relevant fields prior to making an OPA request, augment the request with these fields, call OPA.
Getting a bit ahead of myself, this may well be the best option.
The major, if not the only problem with this approach is that the business logic of AuthZ policy application is now spread across two locations (and two languages):
The application server that makes the very OPA calls.
Which now has to also act as a compact and effective custom data store.
And the OPA policy script (in Rego).
That expects the request to have all the needed data, in the right format.
In practice this approach may fly, but it the downside is that making changes that affect the protocol is now harder, as two pieces of code, in two different programming languages, would have to be synchronized.
The second, minor problem of this solution is performance overhead: to create and parse the JSONs, as well as to make the very call to the OPA engine (unless the caller is implemented in Go as well).
Option 3: Have OPA talk to a separate nearby service
I.e. have OPA pull the data from an external service, instead of having this data pushed into OPA, asynchronously, or per request.
When we started looking into OPA, we always kept it in mind that an OPA rule can use the data stored outside OPA. Our assumption was along the lines of “we could put a Redis cache next to the OPA engine” et. al.
Turns out this is not quite true. OPA can, in theory, talk to a Redis instance nearby. But the only way for OPA to talk to an external service is by making HTTP requests, if only to the localhost
.
This approach poses two problems, a minor one and a major one.
A minor one is, again, the performance overhead, mostly in latency. Keep in mind quite a few AuthZ use cases would be synchronous in nature.
The amount of data massaging to generate and handle a single HTTP JSON request and a boolean (or an array of booleans) HTTP response is, again, likely orders of magnitude larger than what it takes to actually apply the very policy.
A major problem, however, that overshadows the minor one, is that the OPA policy language, Rego, does not appear to make the necessary outgoing calls concurrently.
Maybe there is an OPA/Rego option that I missed, but my experiments definitively show that if the policy is { A and B and C and D }
, and all A, B, C, D
require an external HTTP call, then the OPA engine would only make the B call after the A call has completed.
In other words, the OPA engine appears to take the optimistic approach of not making extra outgoing calls, “hoping” that fewer calls would be necessary to evaluate the policy.
In reality, what we need is the very opposite: we need the OPA engine to make all the potentially necessary “outgoing” (localhost) HTTP calls in parallel, so that, putting the very Rego evaluation aside, the time (latency) it takes to evaluate the policy is bounded by one “localhost round-trip”.
Thus, if the objective is to make the policy decision in one network round-trip, the very entity that OPA talks to should be able to retrieve all the relevant data in one call. Which is, of course, possible, but, again, such a solution inevitably results in the business logic of the very policy application spread between the OPA/Rego code and the implementation of that component that OPA queries via HTTP.
Option 4: Have OPA return not a decision, but a query
This is the option we have evaluated the least, but it still deserves attention.
The idea is for the OPA call to not return a true/false decision, but the “code” of a predicate to be evaluated by the client to arrive at the ultimate decision. This code is then evaluated by the caller of OPA locally.
Simply put, the OPA engine could return a SQL query, which the caller itself then evaluates against its own local SQLite DB to compute the policy decision.
While interesting in theory, this option suffers from all the shortcomings outlined above, most importantly, code bloat, and having the policy logic split into multiple places, which use different programming languages. In this particular example, the schema of the very SQLite DB is now part of the equation as well, which looks suboptimal, to say the least.
Conclusions
From what we know by now, it sounds to me that OPA is great for the use cases when:
The complexity of the policy rules is high, and the policies are dynamic.
The allowable latency for policy application is in dozens of milliseconds.
Smooth service integration is important, thus the ease of natively guarding HTTP endpoints or Kafka topics with AuthZ policies with no code changes is a big plus.
Smooth identity provider integration is important, thus, being able to integrate into the GWT / PASETO / OAuth2 identities ecosystem is a plus.
The ultimate fleet deployment strategy is shared nothing, and it is acceptable to trade latency for isolation of microservices.
At the same time, OPA may be best to be avoided in the scenarios close to our use cases:
Extremely low latency is a crucial part of functional requirements.
The complexity of the rules is low, and we plan to keep it so by design.
The data needed to make AuthZ decisions does fit in RAM iff carefully laid out.
The most high-frequency users of AuthZ are the most latency-sensitive ones.
It would be great to get feedback from the people who have developed and/or have used OPA in the latter scenarios.
Resources
How Netflix Is Solving Authorization Across Their Cloud (slides, video).
Open Policy Agent at Scale: How Pinterest Manages Policy Distribution (video).
A good blog post outlining what OPA is, and what it is good for.
My experimental policies (one, two), and my code with lower-level comments.
I've made myself used to using external service for AuthN, but still on the fence whether I want or need the complexity of external AuthZ. (though my architecture is still evolving around a single core monolith system despite having a dozen satellite micro and not so micro services)
My main grief with external AuthZ is that they are engineered to be too generic and unconcerned with the space and resource consumption.
Have you considered one of the open source solutions 'inspired by Google's Zanzibar paper' as an extension to your Option 3?