Distributed Stateful Workflows
The theory and the concepts behind modern-day distributed software.
This is the first of the two posts on distributed stateful workflows.
It focuses on the theory and sets the stage for the second part: the implementation.
Topics covered in this post are:
The total / partial order, of events and data mutations in the system,
Distributed consensus and intuitive ways to leverage it,
Exactly-once, at-least-once, at-most-once, and a bit of the CAP theorem, and
Last, but not least: a holistic yet pragmatic way to reason about the above.
Personally, I believe in the fundamentals, and that internalizing the concepts is superior to knowing the facts. For readers who prefer the pragmatic approach though, the second post may well be a better starting point. You can always return to the first one at a later time.
Intro
At the highest level, SysDesign is about building systems that behave predictably and consistently, while the data they manage is located on multiple nodes around the globe.
The ultimate challenge is for this distributed system to automatically sustain node failures, as well as broken cross-node or cross-continent connections.
A litmus test of a good system is that if the nodes on a particular continent are down, or unreachable, or misbehaving, the system continues to operate as if nothing happened. What brings the architect joy are production-grade systems for which continuing to fulfill their functional requirements under such an extreme scenario is part of a routine operational procedure.
Granted, “routine” is a bit of an exaggeration here. The DevOps team most definitely is paged if the externalities are as bad as the entire continent losing connectivity badly, or if it is misbehaving in unpredictable ways. For avoidance of doubt let me loosen the statement from the previous paragraph.
It is absolutely essential, of course, that the on-duty engineer is duly aware of the partial outage of the underlying hardware machinery should it be taking place. The true requirements for the system could then be formulated along the following lines:
If the total number of properly running nodes with well-functioning connectivity between them is below what is needed for complete operation (say, ~50 nodes), the system enters the planned partial degradation mode.
During the partial degradation mode:
The READ, immutable, queries are allowed to return stale data. In other words, read replicas can be dozens of seconds or even minutes behind, as opposed to sub-second latency that is expected under normal operation.
The WRITE, mutating, requests can be substantially throttled, so that some or all users may be temporarily unable to perform non-read-only activities. More likely than not, these actions are allowed to be severely slowed down, or limited in scope to some basic operations only.
The system should suffer no data loss and no data corruption as long as some basic set of nodes remain available and can be relied upon.
The basic setup is well-documented in the set of functional requirements, and could be as trivial as the following.
Problem Statement
Without going deep into real-life numbers, let’s consider the following hypothetical set of requirements for a production system:
It should be alive at least three regions / availability zones.
In each of them the machines should live on at least three different racks / network switches / independent power circuits.
For such a setup:
The system should be self-healing.
Once the connectivity between a sufficient number of well-functioning nodes that can successfully talk to each other is restored, the system should automatically resume its operation at full capacity.
Elasticity, as in, scaling up and scaling down the number of nodes, is the same standard operation procedure as the bullet point above describes.
In other words, scaling up by adding nodes is as routine of an operation as managing to not misbehave if and when some recently-up-and-running notes go under.
And, of course, the DevOps team has full visibility into the internal workings of the system at any moment in time.
Down to fine-grained hints along the lines of “we lost this AZ and/or DC”, or “we need to add at least M nodes in a certain location, or at least N nodes in a new location, since otherwise the currently-down part of the system will not be back in operation”.
An important side note to be made here is that this post does not cover Byzantine consensus. The functional requirement for the system described above is that it should handle failure scenarios that can be reasonably articulated to be organic. The problem becomes substantially harder if what the system has to be able to deal with is an intelligent well-informed attacker that is sufficiently aware of the internals of the behavior of the system itself, so that this attacker can exploit the corner cases of this system’s architecture and its implementation. This facet of the problem, however interesting, is a separate beast altogether, and this margin is too narrow to even begin the appropriate conversation.
It should also be noted here that “simple” cases of DoS and DDoS attacks do not qualify for Byzantine failures, and are instead covered by the functional requirements for the system discussed in this blog post. For a potential attack to truly be a Byzantine attack, in the modern-day Web3 definition of this word, it should not just involve multiple machines spamming the system with requests, and not just send the “worst possible” requests in the “worst possible” shape and form, but also be able to adjust its “attack traffic patterns'', as well as the “attack failing nodes set” and the “attack failing connections”, in an intelligent way that has full visibility to the internal workings of the system and to how it can be exploited.
To recap, the system design principles outlined above do result in the system that organically handles the “basic” DoS and DDoS attacks well, although quite some extra work will be required if the system has to be Byzantine-fault-tolerant.
Requirements
For the sake of completeness, another note to make is that the modern-day terminology is rather vague when it comes to the boundary between functional and non-functional requirements in the domain of highly scalable distributed systems. Since I am well aware many people are reading my posts with interview prep in mind, let me do my best to keep this uncertainty to the minimum.
In my book, when in doubt, if a requirement is qualitative it should be a functional one, while if it is quantitative it may be a non-functional requirement. Here, the words”should” and “may” are used exactly as outlined in RFC 2119.
Therefore, based on the example reference requirements outlined above:
Whatever has to do with numbers, such as “at least five nodes in at least each of three AZs per a DC” is closer to a non-functional requirement, while
Whatever has to do with respecting correctness-related invariants is closer to a functional requirement.
Do keep in mind though that your mileage may vary depending on the context of the conversation you’re finding yourself in.
When interviewing for a DevOps/SRE position, the point where non-functional requirements begin being the functional ones is the point where the “minimum number” of racks and/or datacenters is being defined, such that the “healthy enough” system does not require human intervention, while in the “bad enough” state it is acceptable to expect the human action to kick in.
In practice, the rule of thumb is clear though: Look at what’s dear to the heart of the person you are speaking with, and do your best to understand their own success criteria. It’s easier said than done, but it is a skill that can be learned. Forewarned is forearmed, and kindly let me proceed with a more academic outline of the core principles and their pragmatic practical applications, leaving the conversational soft skills part aside from now on.
Developer Experience
In my day job, I work elbows deep with this domain and this scope of problems, since it excites me, and overall I’m finding it fascinating. I write about it quite a bit too, and every time I attack the problem from a different direction.
The aim of this blog post is to uncover the problem from the developer perspective.
I want to talk about, and speak to, the developers whose day job is to implement business logic so that it suits the above functional requirements. This post then is:
about the tech debt these developers – us! – inevitably create along the way,
about putting the ceiling to this tech debt, so that, on the scale of the entire engineering organization, this logic-maintaining machine does not choke to a halt under the weight of its own complexity,
about ways to untangle this complex web of seemingly inter- and cross-dependent problems that inevitably arise when the system is distributed, and, most importantly,
about the best practices we, the industry, have uncovered along this journey that we all took on collectively.
By the end of this post, or rather by the end of the second post of the two, I expect readers from all ranges of experiences from sub-senior to uber-senior to learn a trick or two that would help you to:
tell the easy problems from the hard ones,
articulate this difference efficiently, and
propose pragmatically-effective solutions where the proper point of view is half the outcome.
The rest of this post focuses on what this proper point of view may well look like.
Side note: A good friend of mine once shared privately that his performance review has read “brutally effective”. His role at the time was some senior++ DevOp, and the year was ~2017. Personally, I am very much in favor of brutal efficiency, although, again, individual career trajectories may vary, yours truly is not a “lawyer”, and this post is by no means an “investment” advice.
Back to developers working on the over-increasing-in-complexity business logic under the shroud of future uncertainty. The ugly truth is that the reality is quite different, and, for the most part, far more prosaic than a decently-sized blog post can cover.
Most people whose job is to maintain and extend the business logic of a highly distributed consistent system have their brains wired just like every other developers’. People tend to avoid complexity when they see it, and, lacking reliable goalposts, their thoughts tend to cluster around various “centers of crystallization”, since they are those safety pockets in uncharted waters.
Simply put, two-phase commits, saga-s, leader election, orchestration, etc. are the areas where code is difficult to review and where standard approaches are suboptimal and error-prone.
You are more than half-way there when you can identify these patterns, ever-present in the code directly and indirectly, and when you can reason about their value before these code patterns have fully taken root.
And if this is exactly what you find your team struggling with, this post is for you.
SQL and ACID
Academic detour spoiler alert. To cut to the chase, thinking about how to simplify those saga-s and two phase commits quickly gets us to languages and frameworks.
Like it or not, SQL is quite a good language to define ACID transactions in. Note how carefully I am phrasing this simple statement though:
ACID means exactly what it stands for: Atomic, Consistent, Isolated, and Durable data mutations. Martin Kleppmann suggests rebranding the “A” as "Abortable", and personally I like it, but that’s beyond the point.
We could also argue that “Durable” loses its original meaning in the world of highly-distributed data that is often in-memory source-of-truth’d on a particular node, since waiting for the data to be flushed to disk is both unnecessary and ineffective, but the Atomicity, Consistency, and Isolation requirements would still stand strong.
To make it super clear: the ACID requirement most definitely qualifies for the functional one in my book; if a system to be designed can afford to be non-ACID, quite a lot of what this post is about can (and arguably should!) be viewed as over-thinking and over-engineering.
Transactions are conditional collections of trivial mutations of various data entities that do not make sense when applied individually, but only when treated as atomic compound mutations.
In other words, no observer should be able to logically observe an "in-between" state of the data.
For example, moving a user from one user group to another implies that this user is always observed as a member of one and only one group, never of none or both of them. Similar invariants should hold true for group member counters and other related invariants.
At the same time, yours truly has witnessed trading (!) systems where the ultimate balance reported can result in some positions being double-counted or not counted at all. if the execution of a particular order happened to coincide in time with taking the snapshot for the report. Needless to say, this is not the architecture I’m advocating for. Although, again, one can argue that if those reports are never funneled into the decision-making process and are meant for DWH/analytics consumption only, the cost of such a mistake occasionally creeping in is indeed negligible.
Finally, I'm talking about SQL as a language.
The very term Structured Query Language is extremely bloated these days.
Even though the “L” for “Language” is one-third of it, quite often in the meanings-embedding space, what people mean when they say SQL is closer to an RDBMS, a Relational Database Management System.
In today's world of convergence and proliferation of various technologies, the term "relational" can be legitimately applied to virtually any database, like a good used tires salesman will confidently call any tire a winter one as they see fit.
Are MongoDB or DynamoDB “relational”? Absolutely not in the traditional meaning of the term, but it truly can be argued the other way too these days, so I do urge you to be careful with making definitive statements.
The canonical opposite of a “relational database” is something close to a “key-value store”, where various keys are truly independent. But, with most “key-value stores” introducing some concept of transactions understood as atomically grouped together mutations of various keys, the line truly is blurred, and the difference is more quantitative than qualitative. As in, we need to look at development and maintenance and run costs, since practice trumps theory, and in practice most DBs and key-value stores can be more or less effectively used to solve most problems.
Paradigm
Back to following the theme of developer experience, the immediate next step in scoping down the problem is to eliminate any and all "solutions" that involve having to reason along the lines of:
Distributed state machines,
Two-phase commits,
Compensating transactions and saga-s,
And overall the concept of leader election.
A well-designed framework offers the developer an experience identical to as if they are working with a single-box indestructible logical node that is always online.
This is a bold claim to make. But it’s central to this post, and I intend to defend it.
Problem
Following the DevEx theme, I’d postulate the problem as:
How do we write and maintain the code?
Weak Argument
Starting with strawman arguments first, there is no “better” way to obfuscate a relatively simple piece of business logic code than to make the team implement it in a way that is multi-node-fault-tolerant!
All code is about maintaining the state. Fusing together the state that has to do with the business logic and the state that has to do with its internal representation is a recipe for write-only, unmaintainable code.
As most engineers will tell you, multithreading alone is pain in the butt. In the trivial case, mutexes solve all the problems except, well, performance. Once locking individual resources is not enough and we have to go deeper, the surface area of where a mistake can be made blows up as threads (both organic and green threads) begin to talk to each other. The likelihood of a trivial error being uncaught by unit tests and unnoticed during the code review phase grows exponentially.
This is doubly true when the code is altered and maintained some time after it was written. And quadruply true if the original developers have since moved on.
Yes, the above is a strawman argument, and strawman arguments are all in all not great. But I won’t judge you if you use it to defend your case! If your teammates and your manager have suffered enough to experience the above first-handedly, such an argument may well work, and why invent advanced approaches if the straightforward path of delivering the message does the job? Especially if you did your homework, and can defend your case with real trump cards if this originally strawman argument is countered early on.
So, the argument above is weak, but at times, especially when talking upwards to non-engineers, an emphatically presented weak argument is all you need. The “not an investment advice” disclaimer holds, but I truly mean this – some fights are best won by presenting a relatable yet technically suboptimal argument.
Strong Argument
Having presented the strawman argument, here is the real one.
Compared to multi-threading, multi-node logic is far, far more difficult. Once the internal representation of the state has to do with saga-s and compensating transactions, the development of core business logic effectively comes to a grinding halt.
With multi-threading, a relatively straightforward set of rules can be devised, so that as long as they are followed the developer for the most part is ensured to stay out of trouble.
Lock every individual critical resource using mutexes, use [lock-free] queues so that every object can only be accessed for one purpose at a time, and you’re 90% there.
When multiple resources can be locked simultaneously, think about the chicken-and-egg problem beforehand to avoid deadlocks, and lock/unlock these multiple resources following some “natural order of hierarchy” to prevent deadlocks; this would take you 99% there.
These two heuristics alone would generally result in safe, albeit not maximally performant, multithreaded code.
With multiple nodes in sight though, the very idea of a mutex is a non-starter. There simply exists no good and safe way to reliably lock something in a distributed setting.
What if a resource is locked by a node in an availability zone, or in a datacenter, that just went offline? Maybe it would never come back, therefore there should absolutely be a way to recover the state in order to not sacrifice the self-healing functional requirement. Such an approach immediately hints at some timeout-based logic. But what if that misbehaving machine is fully operational but just too slow to respond? Or maybe it did respond promptly and correctly, but it’s the communication layer that is degraded?
The machine itself may be stalled (garbage collection enters chat!), or, for a trickier case, we could end up in a split-brain scenario where parts of the otherwise-fully-functional fleet could not talk to each other for a non-negligible amount of time.
And non-negligible here stands for “anything that can be plausibly perceived as too slow by the product owner”. For a case of real-time communication, for instance, a few extra seconds may already be "too slow". In the aerospace industry a few extra milliseconds may well be too slow”.
Hope I have convinced you that the problem is by far not as simple as it may look like. And with this we’re done with the problem statement phase and are entering the solution phase.
Approaches
To prepare the reader for the second part of this post, the solution phase post consists of three sub-chapters.
First I will introduce the terms Choreography and Orchestration (candidates, take note, these are big in interviews nowadays!),
Then we will talk about “coloring our functions", ref. this wonderful post Bob Nystrom, and finally
I will wrap this by getting back to how it started: the problem statement for what is known as a workflow orchestration engine, and how Temporal attacks and solves this problem.
Spoiler alert, to set the table:
Yes, I do believe that, as of 2024 and onwards, when a non-core developer thinks of a "distributed lock", or a "two-phase commit", or the saga pattern, they are best to switch to thinking in terms of workflow orchestration instead.
Among other things this means that the "canonically correct" way to construct an ACID system out of non-ACID blocks should be exactly this: define a bunch of workflows that ensure ACID-ness.
So, next time you're asked a nontrivial SysDesign interview question, such as "design a massively distributed relational database when the only primitive we have is S3 or Cassandra", I personally urge you to think from the workflow orchestration standpoint right away.
My firm belief is that, assuming you follow the train of thought of this blog post, such a mindset is a gift that keeps on giving, and it will benefit you many times over.
Leveraging the Consensus
When it comes to keeping the distributed system consistent, the major challenge is maintaining consensus. Even taking the Byzantine aspect aside, a single node of such a system should always assume that:
Whatever it sends out may never reach the other party, or get there after an arbitrarily long delay, and
Whatever it receives may have been send a long time ago, and
Locally-observed world time is just a poor proxy for cross-node synchronization.
The last point about time often gets missed.
Timeouts are generally okay, since they measure time intervals by subtracting one reading of the local clock from another reading of the same local clock. But this is as far as it should get. Calculating “time difference” between two points in time as seen by two different nodes on two different local clocks is just strictly discouraged to say the least.
If you’re into strong typing, my recommendation is to treat time readings taken at different nodes as different types, which just can not be compared to one another whatsoever.
Now, we are talking about a system in which multiple nodes should somehow coordinate. If they do not have to, then we’re not dealing with a distributed system to begin with.
Deriving Orchestration
The landscape or the problem domain of coordinating nodes is well mapped by today.
On the lowest level we have the core concepts such as:
Causality, specifically, Vector Clocks, and
Leader election algorithms, Raft and Paxos in particular.
Vector clocks are the idea that while absolute time readings from different nodes can not be compared to one another, they can be compared to time readings from the same node at some different point in time. Without loss of generality, people often talk about “epochs” or “indexes”. It just often proves to be easier to replace absolute wall time in, say, seconds since the midnight of January the 1st 1970 UTC, by an atomically increasing index generated by each individual node. And, generally, there’s a finite, well-determined set of nodes that have contributed to executing a particular piece of code. The vector clock is simply the idea that { node_a: 10, node_b: 15 } took place after { node_a: 3, node_b: 5 }, while we can not compare { node_a: 3, node_b: 8 } and {node_a: 7, node_b: 4 }.
To add beauty to this idea, if a node fails and is re-started, it may well begin its own new logical clock under a new name. So, the system would not have “node_a” and “node_b”; it would have “node_a_{some_random_hash}” and “node_b_{some_random_hash}”. It is not too much of an added constraint that we expect no request to have been processed by “node a” both before and after it has failed and was then re-started. Chances are, we want this request to be handled by some other node far before “node a” is back in business.
Leader election is the idea to take individual nodes’ vector clocks to a higher level. In theory, once the concept of vector clocks is internalized, it is a first-principled building block, and virtually any coordination problem can be derived from it. In practice though, do you want your developers to think of vector clocks when deciding which cache node to hit? No, not really. You want to have some “intelligent router”, that would abstract the intricacies of failing nodes and further cache warm-ups away from the end user who just needs to hit the cache for a given key.
Now, say you are in charge of developing this intelligent router that maps parts or key spaces to nodes. The logic concealed within this intelligent router would have to handle the case of a cache node failure. Who would handle the case of the very router failing though?
In the world of high availability, of course, there is more than one node of this intelligent router. But these intelligent router nodes need to coordinate with each other. Maybe for the problem of caching in particular it’s no big deal if once in a few million or billion requests they disagree, and some user request hits a stale cache node. Overall though, especially when dealing with financial data, such a mistake is not too costly, but straight out unacceptable.
So, our intelligent routers would need to have a layer of intelligent routers on top of them. It’s turtles all the way, and this very “intelligent routing for intelligence routing” is, in a nutshell, what the leader election system does.
From the perspective of a system designer, one can think of a leader-elected cluster as a slow yet strongly consistent key-value store, where mutations are totally ordered. So that if a caller has sent the update request for a particular key, and the value for this key was successfully updated, then no other client will see the old value for this key if they query it. Other clients may see a newer value for this key, and, in fact, the value for this key may well have been overwritten by the time the successful response has reached the original updater. Moreover, the “newer” value may happen to be the one that was sent by some other node a bit before, but somehow it was processed by the leader-elected cluster sooner. This is totally acceptable, since, keep in mind, no absolute times from different nodes can be compared to one another.
And for every two mutations the leader-elected key-value store knows exactly which one happened before the other one. In a leader-elected state machine, the mutations are totally ordered.
How maintaining a consistent key-value store and maintaining the total order of mutations are two difficult but equivalent problems goes beyond the scope of this already long post. For curious readers, I recommend Kleppmann’s book, blog posts, and tech talks.
Last but not least: unless you truly know what you are doing, I would caution against designing the leader election engine yourself. Without going into too many details, the problem is theoretically unsolvable in the “correct” way. In other words, for any algorithm there will exist a sequence of “most inconvenient” failures, in nodes and/or the connections between them, that will make this consistency engine to never converge to a consistent state.
To make it clear: such a system would then not become inconsistent, it will simply be unavailable, with no new mutations accepted by it. However, there exist plenty of practical algorithms for which the probability of such a stale state rapidly approaches zero as time goes on. The problem is being solved in a very pragmatic way, where pragmatism manifests itself in low-level practical knowledge of which patterns of machine and network failures are more likely to occur together, while which other failures can be viewed as statistically uncorrelated, so that they can be used to hedge against one another.
The bottom line is: if you need a consistent system, most likely, you want to find a creative way to leverage an already-existing leader election engine, so that even during the “turbulent times” of cascading failures, the number of “world-totally-ordered” mutations per second is rather low, while the acceptable latency is relatively high.
For the record, this is exactly what the consistent hashing idea does so well: it only requires one mutation added to the leader-elected storage per its node going up or down, leaving the rest to the cache engine itself. Compared to having a non-zero fraction of cache invalidation requests hit the leader-elected storage, this is a huge win.
Good news is, once you rewire your brain to think in terms of workflow orchestration, the parts about vector clocks and leader election will be of academic interest only for you.
Nonetheless, I believe it is important to talk about leader election and vector clocks in sufficient depth, if only to encourage the reader to further their investment into understanding stateful orchestration. Ideally, one internalizes the concepts of leader elections, vector clocks, and distributed consensus so deeply, that their thought process operates on the level of orchestration and above.
We do not, after all, think of the CPUs and the OS kernel thinking of scheduling processes and [green] threads while using the async/await paradigm.
In fact, many engineers are skilled professionals when it comes to using async/await effectively, while they would struggle to articulate how this concept operates behind the scenes; let alone implement its execution layer from scratch. Divide and conquer is fundamentally crucial at designing languages and frameworks; my job here is to do the same to stateful orchestration.
On to Part Two
This first, academic post has covered:
Two-phase commits and the saga pattern,
Vector clocks, and
Leader-elected, totally ordered storage.
Every consecutive level enabled us to write easier-to-follow and more maintainable code, which is exactly what the industry needs today.
But even in the unrealistic scenario where every single developer in the company has fully internalized all of the above, the code they produce is still not as neat as it can be. The time has come to make the next step.
This next step has to do with learning to think in terms of the orchestration engine. And I am covering it in details in the very next post.