Building High-Throughput Consensus
Producing Total Order by blending Raft, Web3, and secure enclaves.
Two Flavors of Consensus
Consensus protocols tend to come in two flavors:
Classical quorum-based (Paxos, Raft, Zab, etc.) → Strong safety guarantees, but relatively low throughput when the nodes are geo-distributed and round-trip times are high.
Blockchain-inspired (Nakamoto, BFT, proof-of-stake, etc.) → High throughput and decentralization, but with probabilistic finality or heavier Byzantine assumptions.
In this post, I take the best of both worlds and construct a system where low-latency quorum consensus drives globally ordered, high-throughput transaction processing — all while keeping the protocol simple, verifiable, and self-healing.
As a disclaimer, my take was and remains that most products do not need to build their own consensus protocols. It's a very difficult problem where "just works" is not enough. Mistakes are both hard to isolate and extremely costly. Nonetheless, it has come time for me to tackle this challenge, since the blend of product and tech ideas I’m exploring now can benefit immediately from a novel protocol.
Key Concepts at a Glance
As a teaser, here are the key concepts this post will touch on. In the interest of length, I won’t explain them in full detail, but will introduce them casually along the way. By the end, you will have a good sense of the nature, behavior, and value of:
Source of truth, contention, consensus, and strong vs. eventual consistency.
Immutability, append-only data structures, and CRDTs (conflict-free replicated data types).
The gossip protocol / epidemic algorithms / state dissemination.
Single-threaded execution of totally-ordered transactions.
Non-Byzantine consensus via a few on-prem nodes, with possible Byzantine governance.
Finality, execution provability, attestation via secure enclaves.
Please forgive the evangelist tone — a dry design doc wouldn’t make a good blog post, and vice versa. When these ideas (or a large subset) crystallize into a design doc, I’ll share it here too.
Consistency and Finality
Most large-scale distributed systems eventually choke on contention: deciding which of two events must be handled first.
Some systems don’t care — for example, large-scale video processing. Requests are independent, order doesn’t matter, and duplicates only waste resources, not correctness. But in systems with dependencies — like transactions — order is everything.
Take Alice sending tokens to Bob, then Bob sending tokens to Charlie. Bob can’t pay Charlie until Alice’s transfer lands. This looks obvious to a user, but for an architect it exposes the gap between user-perceived order and system order.
If Bob’s device shows “Alice → Bob” completed, he assumes “Bob → Charlie” must succeed. But different parts of the system may handle these events, and “after” in distributed systems only means “after the message was sent through this channel,” not “after it has fully settled.”
That’s where consistency and finality come in. Strong consistency ensures Bob’s notification implies the transfer really is settled. Finality means the transfer can’t be rolled back. Without these guarantees, “after” to a human may not mean “after” to the system.
This is the heart of contention in distributed systems — and why ordering, consistency, and finality matter so much.
For fault tolerance, systems must be distributed — and in a distributed system, no single server can claim to hold the “true state.” A server may have a recent view of Bob’s balance, but it’s generally incomplete and potentially stale.
The real “state” only emerges from consensus across multiple nodes, and only at the moment a request is processed.
That’s why strongly consistent reads are as costly as writes: they must follow the same contention path as mutations. By contrast, eventually consistent reads can be served cheaply from replicas, since users accept slightly stale results.
A good way to reason about distributed systems is in terms of ledgers and immutability. Here “ledger” doesn’t mean blockchain — any append-only journal that preserves invariants qualifies.
Bob’s balance isn’t immutable, of course, but the cleanest way to model it is as a derived property of an immutable transaction log. The “current balance” is essentially a cached value, which can be stale — especially on read replicas.
This is why distributed systems rarely give strict wall-time guarantees. If Bob reconnects to a different server, its cache may lag, so refreshing a page doesn’t ensure he sees the latest transfer.
To reduce complexity, we treat balances as derived and the ledger as the source of truth. In blockchains, this ledger is grouped into blocks; elsewhere, batches play the same role. Adding hashes or checksums prevents the system from “cheating itself,” which most databases already do.
When a system offers consistency and finality, each new block (or batch) becomes part of a single, immutable chain. This lets cheap, scalable reads ask for results “at” or “after” a given block index — knowing the ledger won’t change retroactively.
Problem Statement: Total Order with Low Latency and Durability
The previous part was the minimum viable intro. Now, before I present the architecture for the solution, let me outline the problem statement in unambiguous terms.
What we want to accomplish is the total order of events. Eventually consistent reads can hit read replicas and be behind by a small time lag — generally a few seconds max. Mutations (writes and strongly consistent reads) must be totally ordered: for any two requests, the system must know which one logically arrived first.
In other words, at the heart of the system, any potentially mutating request first and foremost gets a sequence number attached to it. Sequence numbers increase monotonically and have no gaps. This is also known as total order broadcast.
Now, for consensus, we are not aiming for planet-wide distribution at the heart of ordering. The priority is:
Achieve maximum possible throughput while maintaining a total order for mutating requests.
Keep latency as low as possible.
Maintain blockchain-grade durability, where eleven nines are not enough.
Consensus across distant regions requires at least one round trip, which isn’t low latency. So, to be truly low latency, the heart of computation should be geo-localized. This doesn’t have to be one server, one rack, or even one data center. A few availability zones in the same region do the job just fine — even double-digit milliseconds of pings is not something users notice.
That is, as long as durability is guaranteed elsewhere — which it is. All transactional data is replicated around the globe, so even if this particular geo-location becomes unavailable all at once, users won’t experience more than a few seconds of disturbance.
The novelty is keeping transaction settlement times low while the extra round trips (potentially extra few hundred milliseconds) are taken from the finality time budget. Taking the orderbook example for lowest latency, users still benefit from sending their transactions earlier, to have priority in executing them — even though the very confirmation of order execution can come a couple dozen or hundred milliseconds later.
Moreover, by design, we want the user-provided compute — e.g., trading orders, continuing the orderbook example — to happen on our platform. This ensures zero effective latency between customer code and the core engine, which ultimately decides which request came first in case of contention.
Non-Byzantine Assumptions, Path to Byzantine Hardening
Also, the protocol is not Byzantine by design. It’s originally intended for on-prem deployments where the company controls its hardware and users trust the company.
However, I firmly believe this line is being blurred:
Low latency — i.e., high-frequency transactions — still requires geo-locality, so physical proximity of servers is inevitable.
Hashing internal ledgers and publishing proofs is becoming more prevalent, even for non-Web3 companies.
Modern staking and slashing techniques ensure nodes (and owners) have substantial “skin in the game,” deterring cheating.
Deterrence ties to external signature validation and finality — which, with proper design, can be under one second behind wall time.
Secure enclaves are quickly entering the Web3 and near-Web3 space.
Without going into much detail (this post is about the consensus algorithm), the sketch for a high-level Byzantine design would be:
The company owns its hardware in a few availability zones near each other.
This hardware executes the business logic of high-throughput, high-contention transactions in secure enclaves.
For extra durability, Q secure-enclave vendors are used, and the consensus should include at least (Q−1) of them.
Consensus is not yet finality, but a consensus-provided Merkle hash is then validated by many entities around the globe.
Finality is thus reached well within one second, as external verifiers confirm that signatures from (Q−1) enclaves are correct.
There are extra steps not covered here: consensus must be validated only after data is replicated to several locations globally; a mismatching signature should trigger slashing that rewards the detecting verifier; and with fast verification and finality, the total “funds at risk” at any point stays well below the slashed stakes in case of errors or malice.
While the above doesn’t have to run in a purely Web3 setting, it’s easy to do so. All that’s needed is an additional Byzantine-proof governance layer. Governance includes the set of active nodes, their enclave vendors, the corresponding public keys, and slashing rules — potentially controlled by a council vote, a DAO/digital democracy, or another governing entity, with forkability as the ultimate check. I refer to this outer, slower, Byzantine consensus layer as the Control Plane Chain.
System Model and Gossip Ticks
Now, to the protocol. The setting for the heart of the system is:
There are N nodes, where N is odd (say, 7).
The nodes are non-Byzantine: we assume each follows the protocol correctly and holds its invariants.
All communication is public and authenticated. Every message is signed by the node’s private key, and every node knows the set of participants.
Nodes exchange data using a pseudo-gossip protocol:
Each message = { tick_index, payload }.
Tick indexes are strictly monotonic per node. Normally, they increase by one per tick, but a node can speed up if it finds itself behind.
Nodes forward not only their own ticks but also others’ ticks if they detect someone is lagging.
This ensures that any message from any node eventually reaches all others, even if the topology is temporarily fragmented. As long as a majority of nodes can still talk to each other, directly or indirectly, consensus will be reached.
For frequency: for truly low latency, ticks can go as high as 100 per second. Internally, we joke that 50Hz / 60Hz for European and American deployments might be it.
For worldwide consensus where nodes are geo-spread, ~125 milliseconds (1/8 second) sounds like a sensible default.
There’s no requirement for tick frequency to be above or below expected RTT. Practically, setting ticks too high only increases CPU usage. Rule of thumb: only a small fraction of CPU and network (say, under 10% of one core) should be dedicated to maintaining local and peer ticks.
Consensus Timekeeping
We introduce the consensus tick index — the protocol’s notion of global time.
The consensus tick can be the median (i.e., the (N−1)/2-th quantile) of all tick indexes observed across the cluster.
For the system to advance to tick T, at least a majority of nodes must have observed T or higher. This guarantees liveness without strict synchrony: even if some nodes are slow, consensus time progresses with the majority.
Note: the consensus tick needn’t be the median to be monotonic. It could be the 2nd smallest or 2nd largest tick. The median is suggested to minimize end-user latency. Experiments may show that a “3rd largest tick” is an even better heuristic.
Tracking the consensus tick index is a CRDT: the index can only move forward, monotonically, as observed by any node individually or any communicating subset of nodes.
Leader Election by Ticks
Each tick’s payload carries extra information:
The tick index I,
A vote for the leader in the form { A, J }, where A is the leader and J, J > I + 1 is the upper tick (exclusive) through which this node votes for A.
Global rules:
J cannot exceed I + D, where D is the max allowed delta in ticks (default D = 10).
A node cannot issue contradictory votes. A witness presenting conflicting signed statements for the same tick can claim that node’s stake.
Pragmatically, on startup or restart, a node waits (D + 1) ticks (plus buffer) by its local clock to avoid incriminating itself.
This yields a sliding-window soft election:
A node A can act as leader for tick I if it can present proof that a majority of other nodes have granted leadership for I.
On voting policy: absent external intervention, each node keeps voting for the current leader as long as its latency to the quorum is a) within expectations and b) not definitively worse than another candidate’s.
For faster leader changes, the payload can also include up to three next-leader candidates so that concern about the current leader amplifies and converges more quickly.
In practice, tick payloads include more data (e.g., path probes to gauge pairwise latencies), but that’s out of scope here.
Acting as the Leader
A node acts as leader for tick I if it can present the majority proof:
“At least (N+1)/2 nodes have allowed me to act as the leader for tick I.”
With this proof, the leader triggers the action associated with tick I.
We’ll cover what “acting” means (the system ultimately executes user code in exactly-once fashion). But there’s a subtle issue: what if the elected leader fails before acting?
This is real: the design above might result in work for tick I being skipped entirely.
Handling Failed Leaders: At-Most-Once
We avoid stalls by enforcing:
A leader can act on tick I only if the consensus tick index is less than I + S, where S is the allowed time skew.
If the system advances beyond I + S without observing the leader’s action, it assumes the leader didn’t act in time. The work for this tick won’t be done — which is fine; it will be executed in subsequent ticks.
This guarantees at-most-once execution:
At most one leader commits work for a given tick I.
If the leader fails, the tick is declared skipped, and the work is done later.
At this point, again, the system can not trust the leader alone when it comes to respecting time-related guarantees. So, the idea of a two-phase commit for the rescue: for the leader to be allowed to perform the work, other nodes should confirm it indeed in on time to do so.
Separating Quorum Proof from Execution
Critical decision: quorum nodes don’t execute work directly.
Why? In an asynchronous setting, a node could execute work but fail to communicate its result in time, causing inconsistent views.
Instead:
The leader broadcasts the intent to execute certain work for tick I.
A quorum of nodes confirms that the leader has authority for tick I, and that they received this intent within S ticks of I.
Later, the execution nodes (which may or may not be the same as quorum nodes) perform the work.
For each intent, a node outputs one of three results:
Good to go: signatures match; the intent arrived within S.
Too late: signatures match; but the consensus tick is ≥ I + S.
Alarm: signatures don’t match; cheating detected → slash and exclude.
Nodes also broadcast loss-of-confidence votes when they don’t see an intent within S ticks. These votes also accumulate in immutable, append-only CRDT fashion, propagating via the full N * N connectivity (directly or via hops).
For each tick I, a quorum is reached: either a quorum of good-to-go or of loss-of-confidence. No node must “know” locally that quorum was reached — but an external observer subscribed to all nodes will see a monotonic consensus state for each tick I.
Finally, quorum thresholds needn’t both be simple majorities, such as four for yes and for for no for N=7. They should just sum to N+1. Instead of 4+4=7+1, one could elect to go with asymmetric thresholds such as 3+5 or even 2+6 (that is, faster to execute, slower to give up). As long as the two thresholds sum to one greater than N, they’re valid. Yes, 2 confirmations may sound risky — but it’s acceptable when optimizing for happy-path latency.
All these parameters may be configurable in real time by governance on the Control Plane.
This decouples consensus on ordering from execution of business logic.
A Blockchain-Like Skiplist: At-Least-Once
Since some ticks may be skipped, we need a mechanism to carry work forward.
This is straightforward because we’re solving consensus via total order broadcast. Once sequence numbers are stamped, the system deterministically executes correctly.
By analogy: we’re linearizing a moderately large set of independent streams. Each stream is ordered, but there’s no intrinsic order across streams; we must create one deterministically.
If you’re from traditional SysDesign: we’re turning a multi-partition Kafka topic into a single-partition one. From Web3: we’re up-rolling L2 work to L1.
We also need to handle the case where the leader didn’t do their job for tick I yet still construct a total order.
The solution: every intent-to-execute message includes the currently seen HEAD offsets in each partially ordered (L2) stream. That’s it.
Example: four topics T1..T4. At tick I=420, HEADs are {100, 200, 300, 400} and that tuple is confirmed and journaled. At I=421, offsets are {105, 200, 300, 410} — fifteen new messages total (5 in T1, 10 in T4). The intent for I=421 is {105, 200, 300, 410}.
Afterward, any deterministic approach can splice the fifteen messages into a single order (by relative timestamp, rate assumptions, or — for a POC — just scanning topics sequentially).
The number of input “logical topics” is immaterial: the intent can be sparse. Since every node knows the last executed offset per topic, it can include only the ones that moved: e.g., { T1: 105, T4: 410 }.
Most importantly: if work at I=421 isn’t submitted and confirmed in time, the next executed tick will linearize everything just fine. The total order continues. Most users won’t notice.
Only ticks where the quorum confirms the leader’s intent become committed actions.
Other ticks remain empty, but the chain of committed ticks stays totally ordered.
This yields a chain of blocks with strong properties:
Finality: Once a tick is confirmed, it cannot be undone without slashing or external proofs of misbehavior.
Rolling hashing: Once the total order is stamped, it is hashed and broadcast so clients can be sure the system behaves correctly.
Fast recovery: If a leader dies, a new one takes over quickly — ~0.2s within a datacenter, or ~2s across continents.
Self-healing consensus: Nodes continuously monitor quorum health and adapt automatically.
Latency-Aware Leader Selection
We integrate latency awareness:
Every node periodically runs “ping experiments,” sending messages along random chains of nodes and measuring RTTs.
Results are aggregated over ~30s intervals.
Nodes select leaders based on the lowest “quorum-achieving time.”
This also results in a nice observability dashboard for admins to quickly see whether any nodes are showing higher latency than expected.
For robustness:
Each latency measurement is modeled as a distribution (e.g., Gaussian).
We optimize for the three-sigma worst case, not just the mean.
Leaders can even renounce leadership if they detect another node would achieve quorum faster, announcing in advance that followers should vote for the next leader starting at tick T.
This naturally biases the system toward the healthiest, lowest-latency nodes — without explicit coordination.
Why This Is Exciting
This protocol blends the strengths of multiple paradigms:
From Paxos/Raft → deterministic quorum safety and log replication.
From blockchains → leader rotation, Merkle hashing, and finality-driven execution.
From secure enclaves → cryptographic verifiability to strenghten trust.
From Kafka-style streaming → massive scalability by separating ordering from execution.
We end up with a protocol that is:
Low latency for consensus.
Self-healing in the presence of node failures.
Throughput-scalable by decoupling execution from ordering.
Audit-friendly due to signed tick proofs and quorum attestations.
Where This Could Go Next
Effectively, the above design enables:
Trusted execution environments (TEEs) verifying consensus steps inside enclaves.
Staking & slashing layered on top, to disincentivize malicious behavior.
Optimistic batching for 100M+ transactions per second while maintaining strict safety.
For the first POC, the orderbook example — trading instruments one for another — is the natural next step.
Closing Thoughts
This isn’t just about replacing Raft or HotStuff.
It’s about stacking layers of consensus:
A low-latency quorum for ordering.
A high-throughput execution fabric on top.
Verifiable finality via secure enclaves.
If done right, this could power globally distributed systems capable of millions of durable, auditable decisions per second.