Beyond the Schema Registry
We had a good ride. What's next?
The more I work with gRPC, the more I become convinced the very
protobuf-based ecosystem is broken.
Don't get me wrong. Strong schema and its evolution are amazing things, and products such as https://buf.build/ are very important for the industry. Even more so for the schemas of stored and replayed events than for the very RPC use cases, by the way.
But something most definitely is off.
For the past several years it is safe to say that we, the software engineering industry, already are at the point where it should not be important how exactly the code is broken down into pieces before it is run.
Let me explain.
Say, you have a piece of software that involves two components talking to each other at runtime (
OLTP), plus an analytical processing piece (
OLAP). For a simple example, consider:
making transactions in a small online store,
searching for an item,
adding it to the shopping card,
making the payment,
preparing the invoice,
and then later seeing this item moving up in the list of best sellers.
In theory, all of the these are trivial operations.
As an engineer on a team that maintains this code, I should be able to write an end-to-end test, in virtually every programming language, that would be able to confirm that the above succeeds. I should then be able to run this end-to-end test on my laptop, locally, with no Internet connectivity whatsoever. And the test would run in under a minute or so; likely in just a few seconds if all the code is pre-built, and I have only made a trivial change to one of the components involved.
Realistically, the above is all true. Assuming your OLAP DWH is not cloud-based (or at least assuming that you have a local version of it for testing purposes), most decent software development teams have built and integrated a similar setup.
Different pieces of what is end-to-end tested would likely be implemented in different programming languages, and their interoperability may involve a
docker compose topology, or even some
Airflow pipeline. We could argue if it’s a good thing or a bad thing, but let’s put it aside for now. This end-to-end test will run, and it will pass. And it will likely run and pass on your laptop too, if you want to test this thing while on a plane from San Francisco to Sydney, with no connectivity with the rest of the world.
And such an end-to-end test will likely be run via some Github Action on every pull request made by the team. All good stuff. As an architect, I fully endorse this.
At the same time, the boundaries between components will likely be somewhat fixed and hard-coded. If it's, say, five Docker containers, they will all work, and making a small change to any of them will be trivial. But splitting a container into two, or merging two containers into one — even if their logic is implemented in the same programming language! — would be a far more nontrivial piece of work. Strange, huh?
At this point, I'm going to bet most of my readers would only nod your heads, and even become confused. Of course the above is true. Of course, merging two containers into one is far more difficult than making a local change to one of them. How can it possibly not be this way? That's why we have API contracts and end-to-end testing, after all: precisely because cross-component changes are difficult and error-prone. That's what Swagger and
proto registries and service meshes and API gateways are for — to ensure making these difficult changes is less painful.
I hear you.
But didn't we learn how to refactor code effectively, a decade or so ago?
I write about what interests me in engineering. If it interests you too, here’s to more of it.
Nobody in their right mind would argue along the lines of "changing the logic of one function is easy, but if you want to move some logic into a different function, this is a far more difficult task". This is just not true. It is not difficult, for a long time now. We do it multiple times a day as part of our work routine, often subconsciously, while focusing our mind on a far larger task. Refactoring code and moving parts of it between functions, or between source files, directories, or even repositories is not some fundamental skill to master. It’s just a technique, much like typing
JSON.stringify(). We have IDE support to make these changes, and we have nice diff tools to make reviewing such changes easy. In fact, these changes are considered trivial today, and for a good reason — they are trivial indeed.
Imagine the world where refactoring your code requires to first get an architectural and code approval for the signatures of the functions, "for the sake of correctness" and "because this is the process we follow here", and only then, after these altered signatures have made it into the code, you are allowed to make changes to their implementations. What is now a fifteen minutes piece of work can easily become a week. Hope we can all agree such a process would be a needless waste of time.
Now imagine how it feels to walk into most "we need a new API contract" conversations with this very sense that something is remarkably inefficient and quite far off. Well, that's how I feel about API contracts in
.proto files these days.
Another major trend is asynchronous, event-driven programming. Back in the old days, unless you've been using Erlang right away, moving the execution of some code "away from the main thread", so that it can be run "concurrently" with "other things" was a far from trivial task. These days, with node.js, then Golang, and now Kotlin coming in, such a change is also successfully "downgraded" from advanced sorcery to merely a programming technique. Unironically, this applies to other languages, including C++ and Rust, and even C#; in fact, one could say that
parallel for was something C# pioneered. We, the industry, just know turning sequential execution into a parallel event-driven process must be trivial. And we have plenty of means to make sure it indeed is trivial enough, if not in all than at least in most real-life cases.
Okay, this is an overstatement, and some tasks of making things parallel are the opposite of trivial. But it’s hard to argue against the fact that in 2023 we do have the techniques to make things massively parallel, and these techniques are trivial enough that they are taught at bootcamps ffs!
However, we seem to have collectively conceded to the idea that a "change" that affects an "API contract" between two components *has* to be difficult. In the modern world of microservices such a change may even involve a major API version increment, as well as maintaining multiple versions of the API for some time, while other "clients" are "upgrading" to a "new" version. And we take such an argument seriously, with a straight face.
Imagine someone telling you the above process is essential when it comes to taking logic
X from functions
B() and refactoring this
X into a new function,
AB()? Everybody's reaction would — correctly! — be along the lines of: Come on, that's a diff of under fifty lines total, maybe a hundred if you add new tests, just prepare a pull request and send it in, for lambda's sake!
What I'm saying is not that protocol buffers and schema registries are bad. They are incredibly good, if not outright invaluable when a large team of engineers needs to maintain a large, diverse, code base.
Many years ago, when I was developing a high-performance C++ service, I made it an explicit point that our backend can expose its own [HTTP] API endpoints schema, along with all the associated data types, in both TypeScript and F# formats. Because our clients were using TypeScript and F#.
During the active development phase, I would just say: Hey, just pushed a new version of the API to
foo.api.company.com, could you pls check if this “foo dot” is good enough? If the answer is positive, this “foo dot” API would make it to prod by the end of the business day, with zero downtime for the end users. The engineers loved this. the engineers of the company that acquired our tech loved this even more; and we had years-long uptimes of this C++ code with evolving data and API schema, but that’s not the point now.
Was I proud of this? Yes. Would I do the same thing today? No. I would do better.
Several years ago, sending a Swagger link, or a pointer to a
.proto file, was a sign of respect. Today I am turning more and more to viewing this as a sign of laziness. “This is what we have done for you from our side; now do your thing”, that’s what it sounds like to me. Professional, yes, but cold-hearted, and even a tad arrogant. This works, but we can do better; and we should do better. Or at least so I want to believe.
While we’re on this page, end to end tests are unit tests too. We live in the world of Docker containers and Github actions, show some respect to your colleagues and don’t make them have to know all about your development practices. Make your code work on their machine in one click, even if you know they are taking that fifteen-hours flight right after git-pulling your dev branch. And make your code start fast.
In an extreme case of a Web application, for example, consider a build target that ships all the backend as a single WebAssembly component, so that the “service under development” is blazing fast for your partner in crime who’s working on the user facing side of it.
This, plus enough tests, is a superior approach. Once it is adopted, moving forward, schema registries will no longer be needed. Team A could still use
.proto files behind the scenes, team B could auto-generate them from a macro-based signatures in C++ or Rust, team C would say “screw it”, develop their service in Erlang, and make sure they have a build step that converts the internal contracts, decorated in a certain way, into the external ones. As long as each team consistently delivers client libraries for the major languages the company uses, and as long as these client libraries have just a few tests for every method each, the company has all it needs to move forward at full speed.
What I'm coming full circle to say is that quite some time ago already I began seeing protocol buffer schemas as a liability, even as a hack, rather than as some universal force for the good. And this belief is only getting stronger as time goes by.
With all my love for strong typing and for compile-time invariant checking, won't we live in a better world if the team would not expose a "microservice available on a mesh" but "a component with the following set of client-side APIs in the programming languages you guys use"?
Then, if the API "contract" is changing, the first thing this team would do, probably even before talking to other teams, is make this change and see what compilation errors it yields?
Say, there was some
isDeactivated field in the user data type. We know for a fact that removing it would be a breaking change. But what exactly does it break? An admin API endpoint on a back office page that was not open once last quarter, and a data aggregation pipeline that looks for anomalies, which also cover this "useful" field? Duh.
If the build system highlights these particular two use cases within a few minutes after making the code change that removes (or simply renames!) this
isDeactivated field, the very team that "owns" this field may well have their job of eliminating it done between their morning standup meeting and their lunch break.
API contracts and schema registries are great. But API contracts are best viewed as by-products of good engineering practices, not as a source of these practices. And, for a long time now, we can happily live in the world where we do not teach developers to start from writing
.proto files and gRPC function definitions. The team can just maintain their inner code plus a few world-facing client libraries. The very protocol definition may be hidden somewhere in the inner code, explicitly, if you wish. Or it can emerge there implicitly, because the development toolchain allows for proper type inference. Or it can just be non-existent, because at some point the client library stops making RPC calls altogether, and instead subscribes to some pub-sub bus of updates, maintains a local cache, and evaluates results on the fly, as opposed to “calling” the "service".
What I'm saying is not "stop thinking in API contracts". API contracts are a great way to design systems. But I do urge people to quit treating them as a sacred cow. Because idolizing the means, as opposed to solving the end-to-end problem, is just one of many ways in which a bad developer can create ten new jobs in a year.
A good API contract is just a good way to call a
sort() or a
JSON.parse() function. Let's focus on getting our jobs done instead.
I am impressed every time I see how much space a protobuf message takes up compared to json payload, yet I would rather have a way to write more concise RPC declarations at the expense of larger payload size.
The biggest issue I take with protobuf is that it's useful as a _DATA contract_, not so much as an _API contract_.
Before choosing gRPC as my main protocol, a single source of RPC if you will, I'd contemplated a lot, but having a strict-ish syntax with a mature cross-language and cross-platform infra was the key factor for me to jump the wagon.
In hindsight, I still think it was the best choice. I use gRPC-Web for internal web apps, but grew to like writing google.api.http annotations for external consumption via gRPC-JSON transcoding.
P.S: I've been writing some Go these days, and having to write things like `rpc DoSomething(DoSomethingRequest) returns (DoSomethingResponse)` leaves similar aftertaste. It's verbose, the opposite of succinct, it makes my fingers hurt, but ultimately it works.