Durability of Distributed Systems

Or “eleven nines is not always what we need”.

Jul 22, 2025

It was no later than 2008 when Amazon Web Services set the standard of the “eleven nines” concept when it comes to data durability.

According to my quick research, the 2006 S3 paper cited 99.99% durability. The official “eleven nines” announcement took place in 2010. The actual figure appeared in the documentation some time in 2007 or 2008. So, “no later than 2008” fits my journalistic fact-check just fine. But I’m happy to correct the post if it undersells what must not be undersold — after all, durability is an extremely important topic, and we must give credit where credit is due.

The sales pitch from AWS is that if you store a billion objects on S3, you can expect to lose approximately one of them in ten years.

Eleven nines sounds extremely robust.

At the same time, it’s just … not strong enough for quite a few use cases.

Allow me to elaborate, since it is a critical concept of several facets. We will need to agree on some terminology first, and then explore the problem domain from various angles.

Storage is Simple

First, one should not confuse storage durability with compute durability. Surprisingly, the latter term does not properly exist yet, so let’s introduce it.

First of all, storing data is relatively cheap. And nines pile up quickly as redundancy is increased.

Addendum: Friends suggested that the subtitle of “Storage is Simple” is a bit too bold. So let me phrase it better.

Storage for the sake of storage is a relatively simple problem. As in, storage is straightforward as long as the data is immutable: write-once, overwrite-never. Storage is also manageable as long as consistency requirements are low: last-write-wins, and it’s okay for clients requesting the data from different parts of the system to see a different “winner”.

That’s when “storage for the sake of storage” is, well, as simple as one person’s CS Ph.D. thesis.

However, the focus of post is consistency at scale and durability. Seen from this angle, the storage part itself is indeed relatively simple — since everything around the very storage part is far more complex.

Making copies of the entire data blob is ineffective at best, but with erasure coding and cross-region replication, many, many nines are not rocket science.

At risk of stepping on a slippery slope, one could argue that the old-school Torrent network has far higher storage durability than what S3 could possibly offer! Because when some file is seeded by thousands of nodes, it would take a planetary-level disaster to have that file be lost for good.

When thinking in SysDesign principles of cloud computing, the “overhead” of this Torrent durability is huge, since the data is duplicated — copied over in full! — beyond any reasonable number of replicas. But the statement about outstanding storage durability of the Torrent network is absolutely defensible.

Computation is Hard

At the same time, ensuring the durability of computation is a totally different story. Especially when different logically parallel computation processes can affect each other.

When contention is possible, in order to process requests, the system must inherently agree within itself on what operations happened before what other ones.

Sure, many, if not most operations do not conflict with one another, and are easy to run in parallel. But when they conflict they do conflict, and that’s exactly where hard-to-catch problems are lurking.

This is the dreaded total vs. partial order problem, and this is exactly why Kafka publishes peak throughput numbers for massively parallel multi-partition use cases of pushing large messages.

Ordering a large number of small messages that should all make it into the same logical partition is effectively the “failure mode” use case of most message-sequencing solutions. Constructing Total Order is indeed the Achilles’ heel of partition-based solutions, and this is why Kafka is very open about its performance when partitioning can be leveraged effectively.

Using Kafka is simple when the data naturally splits into universes that do not intersect. Re-introducing synchronization when those intersection points inevitably emerge is where we get the whole spectrum of ugliness, starting from SAGAs and compensating transactions, and ending with Dead Letter Queues, which are used and abused in SysDesign interviews, but are hardly a solution for large-scale production systems.

Storage Metadata is Computation

Storage durability breaks down into data persistence durability and metadata durability.

The Torrent example is very powerful, but also quite straightforward — because the problem statement of “restoring file contents by its hash” pretty much guarantees that the contents for the requested hash will never change.

Granted, hash collisions can happen, and there are means to have fewer of them, but the argument holds — a Torrent-based solution has zero contention resolution logic built into itself. And the hash collision problem can be trivially mitigated by adding more bits of entropy one way or the other.

Now, S3 these days is strongly consistent with respect to metadata.

This means, in particular, that the user can logically execute what is called the If-Unmodified-Since file mutation request. As in, the request that should only succeed if the mutation is happening to the version of the object this user has just observed. And if the object has changed between the last time it was observed and the mutation request, the mutation request will be rejected.

Simply put, say a given file is at journaled version 100, and multiple clients concurrently attempt to replace it by the next version, 101. Even if a million clients send the respective request concurrently — with different contents of the file they would love the world to see as the 101st version of it — only one of them will succeed.

Computation Consistency is Very Hard

Much as I want to refrain from Web3 analogies, the “Alice sending money to Bob” example is the best one here.

Imagine Alice has $150 on her account, and she wants to send $100 to Bob. In the adversarial scenario — as in, when we assume the worst case scenario when designing the algorithm — Alice can send a million requests to send $100 to Bob, or to a million different Bobs. Different requests will come from different parts of the world, and they will arrive via different channels. Continuing with the money analogy, some requests may well be Alice writing down paper checks, and some others will be her credit card company and mortgage company and utilities company attempting to legitimately withdraw her funds at the same time.

A well-designed system will make sure Alice’s account will never go below zero. Some request(s) will “win”, as in, they will be declared by the system to have “arrived first”. Other requests will be logically processed after Alice’s account was drained, and, of course, they will correctly be rejected.

It should be noted here that while supporting If-Unmodified-Since is a major milestone along the way to distributed consensus at scale, it is by no means sufficient as the universal building block.
For the design to be legitimately universal, it should allow transactionality over multiple data objects. And the IDs of these extra data objects may well only become available after some other data objects were accessed — as in, the value for some “primary key” in one “table” may well reveal the more “primary keys” in other “tables”.
Google Spanner, with its SQL-like language to execute transactions, offers far, far more than AWS S3 by introducing external consistency. But it is a) prohibitively expensive, and b) requires custom hardware. More on this later.

The point is that even with one account to manage — as in, no cross-key consistency in sight — the problem of maintaining metadata correctness, so that each object has its own strongly consistent lineage is a very hard problem.

If you’re struggling to grasp this, just observe how it quickly reduces to the Exactly-Once Problem, where each command must be executed, but exactly one time. Exactly-once is hard. Metadata consistency is at least as hard.

Putting it All Together

The uniqueness-of-execution, “computation durability consistency” guarantee is far, far more difficult to implement than the “eleven nines” one.

The “eleven nines” persistence durability guarantee is easy to reason about, to estimate, and to verify.

When it comes to metadata durability guarantees, in the world where the state does change over time, even defining the functional requirements is quite difficult.

After all, what would some “eleven nines” even mean for the above example of attempting to concurrently perform an operation such that only one request of a many should succeed?

SQL-like consistency is hard to even reason about. Even “If-Unmodified-Since” is hard.

Logically, for the purposes of the argument, let’s focus fron consistently executing one trivial operation: the distributed atomic swap.

Simply put, we assume each version of each stored object has a large unique hash. Several hash collisions stories are well known already, so let’s think of this hash as a uint128 to avoid any and all trouble.

The “minimalistic”, pun intended, problem statement can be formalized as:

We can only mutate a single object at a time.
- No batches, bulks, or multi-key operations.
The semantics of the mutation request will always be “swap A for B”.
- Where A is the previously observed version of the object with its own uint128 hash value.
The mutation request should only succeed if the object’s state is still “A”.
- The mutation request should fail if the object has changed since.
Distributed consensus means that all clients interfacing with the system see the same data.
- Clearly, since the speed of light is finite, realistically, all “view-mode” clients are eventually consistent.
- The only guarantee we request to hold strongly is that:
  - IF the sender of the mutation request has received the confirmation that this request from replacing A by B has succeeded,
  - THEN it should be the case that at some point of time in the history of other clients reading this object it will have changed from A to B.
- It’s okay for the object to be changed from B to C later on.
  - There is no explicit requirement that every client will see every version.
- The only thing that is strictly not okay (“MUST NOT” in RFC terms), is for the system to declare a certain mutation request to be successfully applied, to only later act as if this request did not exist.
An equally strong requirement is that the system should always accept requests that can be executed.
- As in, the system SHOULD NOT refuse the request of changing the state of the object from A to B when the “current” state of the object is indeed A.

Realistically, of course, most read requests are allowed to be behind (“eventually consistent”). In practice, the read API contract will probably include the version number for each data for each key. So that the request of “return me the object at key K, version V, hash H” is fully idempotent.
And requests for the "latest" version are generally allowed to be inconsistent, since the client may well be hitting different read replicas, one of which is slightly behind the other.

If we were to be picky, we will say that there should be a way to run a blocking request of the type of “return the object at key K version V, and wait until your read replica has it, or until timeout T has passed”.

The client is then expected to only send such requests once it knows for good that version V does exist for key K.

The response for such a detailed request may well be:

“Here it is, the value for key K at version V, with hash H”, or
“Timeout, this eventually consistent read replica was not ready within T”, or even
“The version for K that this read replica has is V2, which is too far ahead of V, and according to retention policy version V is no longer available — please hit the cold storage if you truly need this ancient version V”.

Turns out being both verbose and concise is increasingly difficult as the subject matter gets closer to the topics of distributed consistency!

Ironically, this section was meant to be the shortest one — after all, how long should it take to describe the API contract in strict terms? Well, apparently, strong consistency is hard even when it comes to merely understanding what is there to build.

Quantifying the Acceptance Criteria

Now, how do we even measure the reliability of such a system in “nines”?

The point is, as long as the users are relying on the above RFC, it is just not acceptable for the system to behave inconsistently. Eleven nines will not be enough. If the distributed system has broken the above guarantee, it’s just not a good system, for all intents and purposes.

To approach the durability problem, we can talk about:

The “read uptime” vs. the “write uptime”.
As in, there could be partial outage modes, when the system is guaranteed to return consistent data along the read path, while the write path is degraded.
The “expected mutation throughput”.
So that when in partial degradation mode the system is allowed to return “TOO_BUSY” for certain mutation requests as long as the more critical ones do go through.
The “partition tolerance threshold”.
To quantify how many datacenters or availability zones should be up and running with full connectivity for the guarantees to hold. Needless to say, if the partition factor is not enough, the system should still keep its data in the strongly consistent state — the “write throughput” will have to be throttled and QoS-ed significantly.
The cost of maintaining the guarantees in a world where failures are imminent.
Realistically, outages of various magnitudes happen, and some risk mitigation techniques are more costly than others. At scale, the cost vs. throughout tradeoff becomes critical.
Simply put, for certain critical operations, even the throughput of good old SMS messages may well be enough — perhaps as the last resort to troubleshoot the otherwise-inaccessible datacenter.

The main point stands tall though: in the world where strong consistency of computation is a must, eleven nines are nowhere near enough.

A good analogy here is the C-style “undefined behavior” problem, where a particularly nasty failure mode can bring a customer’s service to a cascading failure, triggering anything from data loss to private information leaking to where it should have never been exposed.
Once the product can not afford such errors, the “eleven nines” data persistence durability guarantee is just not good enough. And if the user must introduce their own mitigation strategies, it’s always extra cost, extra development time, and, most importantly, increased risk of subtle bugs down the road. The provider of the system either does or does not provide strong consistency guarantees.

There are no “nines” when it comes to this strong consistency guarantee. If the user can not rely on strong consistency, the user and the user alone is responsible for implementing their business logic such that a failure in consistency can be self-healed on the user’s side.

I cannot stress this enough: what we call “undefined behavior” is extremely bad. It’s unpredictably, unboundedly bad.

If the user code assumes that whatever is guaranteed to be final is indeed final, this user code will be up to a rude awakening that one time when this guarantee is broken. It’s the domain where fixing the problem after it has occurred may well cost millions of times more than it costs to prevent the problem in the first place.

Writing code that provides this self-healing is an order of magnitude more difficult than writing the code that offloads the problem of consistency onto their storage and/or storage + computation engine. This is exactly why consistency-first durable storage and storage+computation engines are so valuable.

Consistent Reads are Strongly Consistent in Price

Also, an important comment should be made here that strongly consistent reads are almost as expensive as strongly consistent writes.

Read requests that are allowed to return stale data can and will hit the replicas — we can talk about the number of nines when it comes to making sure this data is at most this many milliseconds old.

But read requests that should return the state of the data that is guaranteed to be the “presently most recent” state should effectively “earn their place” in the queue of requests that can mutate the respective key. Which is exactly where the problem of high contention emerges, and which is exactly why consistent read requests should be treated as if they could have modified the data by logically holding an exclusive lock to it — albeit for a negligibly short amount of time.

As in, if the user code does rely on that a successful response means the request succeeded, is it allowed to later fail with an exception if it turns out that this very “eleven nines” guarantee was broken for this particular request?

Massively Distributed Solutions

This post will be incomplete without mentioning solutions that are based on massive decentralization.

We mentioned one such solution for storage — the Torrent protocol.

When it comes to storage plus compute, the solutions also exist — the Blockchain family of protocols.

Since its inception, the Ethereum network has processed over one billion transactions on the mainnet alone. When taking side chains into account the number is substantially higher. And, unlike S3, which is allowing itself to lose data on one of those transactions, the Ethereum ledger is guaranteeing that no past transaction data will ever be lost.

Sure, one day the community may decide that some old transactions can be batched or squashed on the main chain. So that OLAP-grade users who are interested in “archeologically” researching old transactions will need to find ancient indexers, and likely pay extra to retrieve this data. But something tells me that even if this were to happen, the Ethereum community would make sure enough Merkle hashes will remain on-chain, such that this “ancient” data, upon being retrieved and signature-verified, is “uint128-level” guaranteed to be correct.
And uint128-level guarantee, for the record, is closer to “forty nines” than to “eleven nines”. It’s literally “eleven nines of eleven nines of eleven nines”, plus a bit more. We should not be counting in the number of years in which we shall expect to see one error. We should count in the numbers of universes in which we can plausibly expect this one error to show up with any non-negligible probability!

Bitcoin uses the proof-of-work approach, where CPU utilization per block is enormous and where settlements are only probabilistic — albeit with many, many nines after just a few blocks. This approach has its merits for Web3 purposes, and Bitcoin enthusiasts love it, for all the good reasons. But we are not talking about Web3 here.

Suffice to say that Bitcoin only offers probabilistic determination. As in, the probability that some transaction will be “declared void” is never zero, although it does tend to zero quickly. Ethereum’s proof-of-stake, on the other hand, offers finality by design.

Finality simply states that there exists a network-wide consensus of the state of the system. So that once a transaction is finalized, no matter what happens in the future this currently-finalized state will forever be part of observable and traversable history.

And, unlike If-Unmodified-Since and Compare-and-Swap, which only operate on a single logical key, the Ethereum Virtual Machine (EVM) can run arbitrary code as part of its transactions. Sure, these transactions — they will be EVM smart contracts — are prohibitively expensive and prohibitively slow when compared to more standard SysDesign solutions such as, well, Google Spanner.

The Web3 community is making major progress when it comes to improving durability while reducing costs and settlement/finality times. Some top-tier networks, such as Solana/Stellar/Polkadot, as well as dedicated network such as the Internet Computer Protocol, have the commitment in their whitepapers to strive for highly durable computation that support arbitrarily complex requests.

However, when it comes to massive decentralization first, it is just very hard or even impossible to design the system that is at the same time general-purpose, high in storage and compute capacity, and is by design infinitely scalable as more volunteers add more nodes to it. Adversarial incentives and zero-trust environments have inherent limits to what they can do quickly and cheaply.

So, I do not expect massive real-world massively decentralized products that get widespread adoption from non-Web3 communities. When it comes to engineering distributed systems, there will be no Great Leap Forward where a huge number of privately owned node provide general-purpose service when it comes to low-latency high-durability storage plus compute engines.

Targeted vertical solutions will definitely exist — I am working on one as we speak, after all! — but when it comes to general-purpose strongly consistent data storage and data mutations, the fully decentralized zero-trust approach just can’t compete with full-scale cloud providers. Their business model aligns better and better with providing strongly consistent solutions, and the tech does get better over the years.

Closing Thoughts

Nonetheless, the worlds of “complete decentralization” and cloud-first cross-region redundancy and durability are converging. More and more, we hear the talk of consistency guarantees that:

Satisfy the “eleven nines” data persistency guarantees, AND
Satisfy compute finality guarantees that are at least as strong as what Ethereum offers.

For a powerful yet illustrative example, imagine if Starlink wants to launch its own competitor to both Google Spanner and Ethereum at the same time.

Starlink’s hardware ecosystem is most definitely closed, as in, there does exist a centralized authority which keeps inventory of all the devices presently on the network. This makes a Starlink-wide solution distinctively different from a Web3 solution.

Since a Web3 solution can not be declared sufficiently trustworthy unless community members can add their own nodes — to share some profit of the operation of the entire fleet, and, not less importantly, to “keep each other honest”, to the degree that the “original” nodes maintain no privilege or special powers over the “newcomers”.

Knowing a thing or two about how Elon Musk conducts his business, I doubt it he would be interested in leveling the field first, effectively opening himself up to more competition where he is 100% in position to enjoy his monopoly while it lasts.

At the same time, Starlink satellites are far closer to blockchain nodes than they are to datacenters and availability zones. This is simply due to the fact that while orbiting the Earth, the relative positions of satellites vary quite a bit.

Generally speaking, for a satellite-based “Spanner”, there is no such thing as availability zones.

First, each satellite is fully autonomous, so it is its own availability zone.

More importantly though, the very idea of proximity makes no sense, since the “blast radius”, quite literally speaking, will affect a random subset of satellites — in space, there is no such thing as some “East coast tornado” or “West coast supervolcano” or “Southeast Asia political instability”.

Granted, the Starlink example is a bit too grandiose and a bit too extreme.

Nonetheless, I think it is an important thought experiment when it comes to designing resilient, durable, strongly consistent distributed systems.

To end on a good note: Eleven Nines are dead, long live Eleven Nines.

As the industry, we have obtained a wealth of knowledge, both theoretical and practical.

Time to build better primitives for the brave new Strongly Consistent, Durable world!

Dima Korolev

Discussion about this post