Stateful Orchestration Engines

Thinking and coding in Blue, Red, and Maroon Functions.

Jul 27, 2024

This post is the second of two, and it is far more pragmatic than the first one. If you are interested in the underlying theory, please refer to part one.

The pragmatic part starts from a question that is asked way too often these days: how does a distributed system maintain its state?

Choreography and Orchestration

A distributed system is orchestrated if its nodes look up to a logically singular leader for instructions.

A distributed system follows the choreography principle if parts of it are designed to be capable of self-organization.

In simple terms, choreography-driven systems are built for resiliency, sacrificing quite some maintainability for a bit of extra flexibility. This may come in handy in the aerospace industry, although given the latest development in the year 2024 even this statement can be called into question.

For the other 99.99% of cases the potential gains that choreography offers are greatly outweighed by the downsides of the sheer complexity of building, debugging, maintaining, and extending such a system.

Adding functionality to a choreography-based system often requires re-thinking and re-factoring a lot of logic that does not immediately appear to be related to the very piece that is being altered. Therefore, very few real enterprise systems would benefit from preferring choreography over orchestration.

Other dichotomies between choreography and orchestration include:

The orchestration model is more imperative, “controlled from a single center”, while choreography allows for more dynamicity; some “local self-governance” if you wish.
The orchestration model conceptually assumes that the state of the distributed system is described explicitly, and the transitions of this state are individual pieces of code that can be reasoned about individually, in isolation. On the contrary, with the choreography model, even the very idea of reconstructing the state as seen at some particular moment in time can prove to be a more difficult problem than redoing the whole thing from scratch with an orchestrator.
Orchestration is generally more imperative, and the state of the system between major checkpoints can be followed linearly over time.
The happy path is often the same if we follow the orchestration and choreography implementations, but error handling, and, especially, restoring the system to its past, consistent state will differ drastically.

It should be noted here that the “single leader” in the orchestration model by no means implies a physical single leader. It would be a single point of failure had there been one. In practice, of course, the logical single leader is of course a small cluster of leader-elected nodes.

As we will get to soon, the very idea of orchestration is to explicitly limit the scope of the leader-elected part of the system to one task and one task alone: workflow orchestration.

Make no mistake: choreography is extremely fun to work on. In my humble opinion, it’s just not a good fit for the majority of tasks the software industry is facing these days. If I were to make a random guess, I’d say that for a one developer-hour where choreography is the best fit, there will be thousands, if not tens of thousands of hours where using orchestration is just a safer, faster, and more straightforward way to build and ship what needs to be built and shipped.

It should also be noted that there does exist the middle ground: a choreography-centric approach that is designed to be easier to maintain by introducing traces of orchestration into it. In a nutshell, this is done by keeping a journal of all the commands the system has received, so that replaying and/or reverting atomic changes can be done in a more centralized fashion. For common failure modes, such as individual nodes failing, this pattern makes the recovery logic easier. Rest assured, if you are thinking along these lines, you do not need this post; for the vast majority of developers and architects out there my recommendation remains to stick with orchestration.

Imperative vs. Functional

There are important similarities between:

going from stateful imperative to event-based and coroutine-based development, and
going from reasoning in terms of saga-s and two-phase commits to reasoning in terms of workflow orchestration.

Everybody knows these days that working with non-blocking I/O is hard. Thankfully, most modern-day developers never had to invoke epoll() directly. What we have largely converged to using these days is the async-await paradigm.

If you have not used epoll() and can’t relate to its pains, here’s an example from a more recent past: the callback hell of node.js.

Before futures, promises, and async/await were widespread, we got quite used to dealing with functions where the last argument is the callback function to be invoked upon completion. The convention, popularized by node.js, is to have these functions take two arguments, the error and the data .The immediate next layer of nesting checks the error first, then proceeds to work with data if appropriate.

This programming style is distinctly characterized by using wide screens and a small font, since indentation quickly makes the innermost layers of called-back functions impossible to read without scrolling right.

An even more painful part was debugging. When stopped on a breakpoint, the debugger will show the stack trace of, well, everything above it. Some of this stack may be relevant, some may not, but, most importantly, not everything that is relevant will be there.

That’s the curse of asynchronous programming, which an entire industry was immersed into in the early to mid 2010s.

I do love making bold statements, and I stand behind each and every of them. Do not agree asynchonious programming was the curse some ten years ago? Let’s chat!

In fact, I would go on and make a stronger statement. The blessing of node.js was in exactly this: it highlighted how powerful, and yet how painful it is to develop and maintain asynchronous code.

One could think that not having to worry about threads and locking would result in code that is easier to write and maintain. Alas, in the era of node.js, writing, and especially debugging clean and easy-to-follow code has proved to be a difficult challenge.

And it is exactly this challenge that has resulted in the now-widespread adoption of the async-await programming paradigm.

Let me kindly outline the vision of what the next step of comparable magnitude looks like for our industry. This is the step towards orchestration engines, which should make a lot of logic easier to write and maintain, much like async-await helped us escape the callback hell.

A New Flavor

In a way, the above was the preparation for this most exciting part of the post. I would like to outline it by referencing back to the wonderful blog post What Color is your Function? by Bob Nystrom.

Bob masterfully talks about having to color every single function into red and blue, plain and simple. No function is uncolored. And there are fixed rules to how the colors can interact.

Spoiler alert. The red functions are the asynchronous ones. And the rule is:

Blue functions can be called from everywhere.
Red functions can only be called from other red functions.

Bob then has a very specific subsection on why red functions are more painful to call.

He then half-jokingly presents the argument that, of course, we’d rather all use only the blue functions – they are more universal, after all! – but those language designers – no doubt, pure sadists! – have made some library functions red by design.

Bob’s post dates back to 2015, and it is understandable that back then some selling was in order. Not today though. If asynchronous library functions did not exist by design, good developers would have invented them by now! And this would have to do with ease of use and efficiency, not with torture.

Because when the logic of a “trivial” endpoint should go to the cache, then the database, then journal some changes in the log, all while being a good citizen of the service mesh deployment, you really don’t want to be using blocking I/O. Your HTTP GET and your SQL request and your Kafka publishing would all be async calls. And you’ll await their results before signaling to the caller of this endpoint that the logic can be considered done.

Some of those await statements would be purely to ensure the order of execution. Others will take place inside the conditions of if-statements, so that different responses from other components of your system have effect on the logic of execution of your async code.

Soon afterwards, for await became widespread, even though by that time the .NET Framework was very particular about the difference between IEnumerable and IObservable, and Python had generator functions and yield return.

Because, as virtually all developers know these days, allocating one OS thread per one outstanding request is just not scalable when the parallelism factor is in the tens of thousands. And tens of thousands concurrently serviced requests is not too much, as long as your requests can run for several seconds each. Sure, you better respond to the end user quickly, but the extra journaling steps, lazy cleanups, and potential compensating transactions should stuff go wrong may well live within the system for several seconds.

Add audit tasks, log rotation, metrics aggregation, health checks, and even a service that handles some ~100 requests per second may well have to be “running” ten thousand concurrent operations. “Running” would be an overkill term though, as the vast majority of time of executing these tasks will be spent in await-ing for other async-ronious operations to complete.

The point is clear. We have long since all agreed that async/await is a first-class citizen tool at our disposal. We have learned how to use it well. And we are not looking back to living without async/await.

This paradigm of using futures, promises, async/await, and green threads does not solve all the problems, but it solves most of them in the respective domain. And it fits what arguably is the most important success criteria, which is: an otherwise-average developer can be trusted to write async/await code, and, more often than not, this code will be both correct and maintainable by other otherwise-average developers.

Behold: Maroon Functions!

A certain class of problems is underserved by the red, async-ronious functions. And these are the functions where certain steps must take place no matter what.

They need a new, royal color, and I think maroon would do.

Red, cooperatively-multithreaded, stackless, resumable, asynchronous functions have crack-opened the world of massive parallelism. They enabled us to do miracles with just a few human-readable lines of code.

Where red functions do need help is with persistent state.

If you could count on a machine that never fails (and never loses connection to the rest of the fleet), then the red functions would be sufficient to implement, well, almost everything we need today. But hardware is neither infallible nor indestructible.

Alas, when proper long-term state management is truly necessary, we are back to the world of manually persisting and restoring the state. And since persisting the state can fail, this world comes with its own two-phase commits and saga-s and compensating transactions and a plethora of ways to shoot one’s foot.

Maroon functions do exactly this: when it comes to maintaining the state, they are guaranteed to get executed no matter the interruptions. Even if this execution involves an arbitrary large number of retries over a large number of nodes in different datacenters, continents, or even planets once we get there.

Much like with blue and red functions, it is not intended behavior for a non-maroon function to “make a call” into the maroon function and observe the result of its execution.

Surely, a red function or a blue function could schedule a maroon function for execution; rather, I would suggest the terminology of initiating the maroon workflow. Surely, a blue function could await the result of this maroon function, if only by waiting for some field in the database to change.

But that would likely be an antipattern. Workflows kicked off as maroon functions are best to be awaited for by other maroon functions.

The parallels between maroon functions and red functions are hardly unnoticeable. Let me illustrate them by examples.

Execution Vehicle

Blue-Red

Unlike a blue function, a red function has no stack, and is suspendable and resumable.

Thus, a red function, even if the call to it originated from the blue function, can not possibly access the variables and objects that “belong” to this call-originating blue function.

In order for this red function to have access to those variables and objects, these variables and objects should be allocated on the heap, not the stack. They would also have to be captured by the red function so that they are not garbage-collected prematurely.

Effectively, for a blue function to pass some data to the red function, it creates a boxed-down envelope object, puts the data into this envelope, and passes the ownership of this envelope from the “blue runtime” to the “red runtime”.

Maroon

A maroon function not only does not have a stack. It does not even have a dedicated physical machine associated with its execution!

Creating an envelope by allocating an object on the heap instead of the stack is by far not enough for the maroon function to be able to access this object. For the maroon function to have access to some “variable”, or object, provided by the caller, this object must “live” outside the realm of a particular node. The “storage medium” of the “maroon-friendly” variable should transcend the notion of the node altogether.

Yes, a database. But a database with extra guarantees on persistence.

You guessed it, of course: this “maroon variable” would have to be persisted and maintained by the leader-elected distributed consensus engine.

That’s the beauty of maroon design starting to shine here. The leader-elected consensus engine is hidden from view, but to enable this we needed to carefully box down its responsibilities, to include what it must include and nothing else. And fusing together the execution layer with the state persistence layer is key.

Allowed Operations

Blue-Red

A blue function is synchronous in nature. It can not call a red function and wait for the result of this red function to become available. I mean, it can, but that’s suboptimal at best, since the OS kernel level thread would be blocked for an extended, potentially unbounded, period of time.

OS kernel threads are generally far more expensive than lightweight green threads. Properly await-ing on the result is a concept from the world of green threads and cooperative multithreading.

So, while blue functions can kick off red functions, if a blue function is to analyze the result of the call to a red function, this would have to be a second-order call. A common pattern is to think of some blue OnComplete() callback, the call to which will be kicked off by, drum rolls, the red function itself.

Maroon

Similarly, while a red function can initiate execution or a maroon function, simply await-ing on the result of its execution would be an anti-pattern.

The intuitive explanation is that a maroon function can literally take over a month to complete! Even if your framework that executes the originating red function could have an uptime of over 30 days, I would bet that more often than not a new version of your code and your binary will be rolled out, replacing the currently-running instance, and thus losing the context.

This, of course, unless you use Erlang. Erlang does miracles. But this margin is too narrow.

Since the result of this maroon function executing is presumably business-critical, naturally, for this business-critical result to not get lost, it has to be another maroon function that handles the result as it is ready.

In particular, it immediately follows from the above that maroon functions:

should be deterministic,
should not use any node-specific logic,
should not reference wall time in any way, shape, and form,
and should not have side effects other than what another maroon function can handle.

For instance, to send an email in an exactly-once fashion, the “send email” API should properly support idempotency tokens. It will be the responsibility of the (distributed) execution runtime of maroon functions to generate this idempotency token as needed, and to keep it the same if and when the execution of this maroon function has to get transferred between nodes, AZs, DCs, or beyond.

To simplify this, maroon functions should only use maroon APIs, and operations with side effects, such as sending out an email or charging the user’s credit card, should be implemented as maroon APIs. Where exactly-once and ACID-ness is of essence, anything short of going “full maroon” re-opens the vicious cycle of having to deal with sagas, two-phase commits, and compensating transactions, which we have specifically decided should go away.

Communication Patterns

Blue-Red

Blue functions are synchronous. Unless the developer is in the mood to get into epoll()-style programming, effectively reinventing three threads on top of thread pools orchestrated by OS kernels, synchronous functions are meant to perform synchronous tasks.

And, seriously, we are not talking about developers who are in the mood to get into epoll()-ing stuff. Not having to use epoll() is effectively the reason behind the asynchronous, red, functions being invented in the first place!

So, if a blue function needs to take action based on the result of a red function, the implementation pattern would likely involve some [lock-free] queue inside the [virtual] machine, so that the respective blue function is invoked in its turn as new data is being published.

I thought long and hard about whether the previous paragraph should be rephrased for simplicity and clarity. But we live in 2024, and I’m concluding the reader will not be put off by the notion of publishing into a lock-free queue that lives within a virtual machine!

There’s a SysDesign concept creeping straight into the realm of an individual binary. There’s decoupling and separation of concern built straight into the execution environment that contains both blue and red runtimes. Logically speaking, it’s not hard to imagine the physical machine as a mini-datacenter, where there are different CPUs running the blue functions and the red ones, with the communication between the blue and the red parts going through queues.

Within this mini-datacenter these “red CPUs” and “blue CPUs” can access the same memory and leverage the same cache. It’s a blessing in disguise though, since sharing memory gets us back to the world of locking and mutexes that we are so longing to avoid.

On the bright side though, the “red CPUs” can safely assume that as long as they are alive the “blue CPUs” are alive as well, and vice versa. Having to worry about “parts of your CPU being down” from inside your own single-node code would be a nightmare for 99+% of modern-day developers.

Maroon

A maroon function can, and often does, publish its execution result into some pub-sub bus as an event, so that various red functions can consume those maroon-completion events and act as they please.

This is exactly where the publish-subscribe paradigm begins to shine at its fullest, since interfacing back from maroon functions to red (and blue) functions inherently involves decoupling the maroon task completion event

Hence I do not like to refer to maroon functions as being “called”. Rather, maroon functions represent workflows, or simply tasks, that are being executed.

For an analogy, a Github action does not “return” a value upon success. It can do a lot of useful things, from pushing artifacts to sending emails and triggering other workflows.

It would be absurd though, while not impossible, to have a, say, Golang service kick off a Github action and await its execution. In fact, one could imagine a scenario where such a Golang app would be a preferred solution at a certain stage of a certain company, albeit not in the long run.

But, more often than not, thinking of Github actions as functions that “return” values, as if they were regular “functions”, would be an antipattern to avoid. For the trivial reason that due to them not being tied to a particular node, Github actions are simply more reliable than the node that would trigger them.

Therefore, for the overall design to be robust, if something needs to trigger a Github action and then somehow react to its completion, this something better be a Github action itself. Behold, this is fair and square the concept of maroon functions as I am introducing it!

Don’t You Love it when a Plan Comes Together?

As if the above is not mind-blowing enough, let me dare to leave my dear readers with two more ideas:

Maroon Threads

Similar to green threads, we could now talk about maroon threads: a “boxed down” execution environment for a particular function.

This boxed down environment is mostly expected to run a certain piece of logic locally and quickly. But at the same time is guaranteed to persist the state of this execution elsewhere, so that if and when this local-and-quick execution is rudely interrupted by, say, a machine dying, the state of this execution is not lost forever, but promptly restored on another machine.

Broadly speaking, this idea is the main concept behind the paradigm of Event Sourcing. The lighter version of Event Sourcing, the Listen to Yourself paradigm, also constitutes a major step in this direction.

Workflow-Oriented Programming

Also, I would like to draw an analogy between the paradigm of thinking in blue-red-maroon functions and the original idea behind the paradigm of object-oriented programming.

At the origins of OOP there were objects that communicate by messages sent from one object to another. That’s a perfect way to reason about the logic of an application.

Then the industry quickly scoped this very idea down to { public, protected, private } and { inheritance, encapsulation, polymorphism }, corrupting the core idea. These days, when I say Modula-2, Objective-C, and Erlang are the languages that still keep the OOP paradigm somewhat alive, most people look at me as if I’m crazy, because everybody knows it’s the magic word class that makes the code OOP!

But during this derogatory transition a major scope-down went largely unnoticed. With the emergence of ORMs and with “clear” separation of code and data, we now assume that the presumably-perpetual physical “objects”, in the OOP sense of this term, are instead ephemeral.

As in, our objects are not something brick-and-mortar. Instead, objects are what “lives” fully within one machine, one node, one execution environment. If this node dies, the “object” dies with it.

A well-designed ORM will then do its best to reconstruct “the same” object from the DB, so that business logic rules could be applied to this object as if nothing happens. But this is exactly the place where most issues take place: the “best effort” of the ORM is hardly good enough to provide guarantees about his object’s state, and the discrepancies between the true object and its reconstructed incarnation tend to accumulate and backfire.

In a way, thinking in maroon functions restores the long lost order in the now-chaotic universe of software engineering. There exist objects. Each object has a lifetime. Objects can interact with one another. Programming, as in, codifying the rules of these interactions is exactly what business logic developers do.

Objects are, of course, persisted somewhere, but the better the abstraction layer is designed, the less should the business logic developer concern themselves with where and how this persistence takes place.

All in all, it’s time we transcend the notion of pushing data to code or pushing code to data. For cleaner separation and decoupling, and for our core business logic code to be more maintainable and less buggy, we should transcend the notion of the application server(s) that executes the code and the database shard(s) where the objects are persisted.

Rather, we should think of mutations of objects by means of our business logic code as atomic, self-contained, inseparable operations. And we should choose frameworks that provide enough ACID-level guarantees that the particular business logic mutation executes in full or does not execute at all.

Yes, this includes cross-shard, cross-domain, cross-org, and even cross-company mutations. We have long developed the theory behind the means to implement such mutations clearly, correctly, and efficiently. We just never had the patience to internalize this approach.

Of course, the “logical” “objects” “live” in persistent storages. In the perfect world I am envisioning though, it is absurd to think that an object A, “living” inside some PostgreSQL database, could “send a message” to an object “B” that “lives” in some Redis cluster. Try taking such blasphemy and the HRs will promptly escort you out. Business logic developers in our futuristic company do not have to think about where various objects live, much like they do not think about whether their publish-subscribe bus is even Kafka.

Business logic developers should leave these tasks to SREs, DevOps, DBAs, and DBOps, and focus on what they do well: the core business logic of the application. That’s what proper decoupling should bring into the world of software engineering, and this is what the concept of maroon functions enables in full.

We need to tweak our minds to think in a new, cleaner and simpler way of thinking. And maroon functions, and the overall “maroon threads” and “maroon runtime”, are offering us exactly this opportunity.

Decentralized Deployments

This post would be incomplete without mentioning some of the most interesting trends in the decentralized / Web3 development.

To give credit where credit is due: I first heard the idea that there exist many interconnected machines, and they are all evaluating the same function in the context of Web3.

Non-Web3 developers these days are familiar with Firebase and with Mongo. There are far more products in this space. What they offer is the abstraction of some storage layer that the user does not have to worry about.

Developers can write their code and test it locally, against a mock instance of the “toy database”. Their tests will be run by a Github action against a remote, cloud, staging or canary environment. Then, once pushed to production, the code will use the real database. These three setups may well be very different in scale and resiliency, but the business logic code remains the same. It’s the very fact that the code remains the same that has drawn many developers to this paradigm.

However, both Firebase and Mongo do draw a thick line between the storage layer and the application server that runs the business logic.

With the idea of maroon functions taking over, I can envision the world in which the code that developers write is agnostic to whether their { code + data } combo runtime exists:

Locally, on a developer’s machine,
In a test cluster, run by a Github action of some kind,
In a proprietary production datacenter, with disk drives physically protected from unauthorized access, or
In a decentralized, Web3 setting, where the concept of physical protection of user data is substituted by homomorphic encryption and zero-knowledge proofs.

Simply put, the “cloud development setup for the future” may well involve the developer of a small app “paying” some trivial amount of money to “rent” a storage + compute environment from the “cloud” of like-minded individuals offering their own computers as nodes.

If you are familiar with Ethereum, it would be a decent analogy to think of designing your business logic in terms of having to pay gas fees for their execution. These gas fees would need to include hot and cold storage, but these details would quickly get me into another long post.

Your company then has a lot of flexibility:

It can go “into the cloud”, renting storage and compute from AWS or Azure or GCP to run its business cost-effectively,
It can leverage the decentralized setup of their product, knowing full well it will then be independent of any big players’ policy changes,
Or it can opt for a hybrid approach, such that it can be decided per customer whether they prefer to get locked into some Big Co’s cluster, or to stay always-on elsewhere.

This is true hybrid cloud computing to my taste. Instead of choosing between spot instances vs. provisioned ones, think of perfectly elastic horizontal scaling on any hardware as the baseline to improve upon.

Get provisioned machines on Hetzner if you please, where they are as reliable yet cost less to run. Or even incentivize people all around the globe to run your code on their nodes by offering them payments in whatever means they prefer, proportional to their contribution to keeping your product up and running. Maybe some of your paying customers would in fact prefer the latter, while enterprise-grade users would need regulatory compliance with local data residency laws. Your business logic code remains the same. And your developers all build for and test on a single setup. It’s the execution layer that changes, and this execution layer is what most developers choose not to see and do not have to think about.

For an average developer on your team, running on a decentralized/hybrid approach for some customers is no different than, say, moving on with a different JVM provider. Some minor changes may take place, and some specific bottlenecks in performance may emerge. But the majority of the code stays the same, and keeps exercising the expected, correct behavior. Developers can by and large ignore what happens behind the scenes and focus on their jobs. DevOps, on the other hand, can and should abstract away from what the business logic code does and how it is written.

Similarly, in your own datacenter, do not think in terms of Kubernetes workflows and cross-AZ / cross-DC migrations. Instead, think of dedicating a certain amount of storage + compute to the Maroon Execution Engine itself, and have the team of Maroon SREs, Maroon DBAs, and Maroon DevOps decide how to best allocate the rest.

This seemingly impossible future is extremely easy to imagine once the business logic is implemented in maroon functions. That is, once the implementation of the logic of your business is free from impurities such as nodes, DB shards, data locality, and other mundane details. Instead, think in vector clocks for causality, think in strong vs. eventual consistency for the write path and for the read path, and think of orchestrated workflows for what the functional requirements says should be ACID and/or exactly-once.

Closing Thoughts

By now, I have dedicated several years of my career to architecting consistent, reliable, low-latency systems. I have built and maintained them myself, I worked closely with various teams in different roles, and I have overseen the processes top-down.

All in all, I’m probably well-qualified to say that my view of the world of designing, shipping, and maintaining large systems is extensive and holistic.

And for the past several quarters my thought process keeps coming back to the simple fact that we are not thinking in terms of stateful workflow orchestration nearly as much as we ought to.

In my formative years as a senior engineer turning architect, I recall it vividly how developers suffered from:

Having to work with epoll() before green threads were invented.
The callback hell before widespread adoption of async/await.

Riding this train of through into today, I see how developers struggle with:

Long polling, since proper push APIs have not yet taken over.
Back-pressure issues, as the industry has not yet mastered proper decoupling.
Lack of total order and causality, since instead we have divorced topics & partitions.

The more I am thinking of the above, the better the picture of the future of engineering of large-scale distributed systems is taking shape in my mind.

And it is the picture where stateful workflow orchestration plays the central role.

Hope you will grow into this picture too. So that together we can make software engineering as fun and exciting, and as effective and productive, as it was when I started in the field 20+ years ago!

Thank you for reading.

Dima Korolev

Discussion about this post