Here comes my unprovoked attempt at a C++20 coroutines tutorial.
Intro
Let me share a secret to begin with. When I began looking into coroutines in C++, I was terrified at first. There’s tons of extra bloat in syntax. There are far too many ways to make a mistake.
Errors, both static and runtime, are not easy to make sense of. With static checks, after a while, I began to understand the logic of the compiler so that I could make sense of why exactly the code is wrong and should not build. In runtime, it was also not trivial for me to grasp when exactly the promise and the return type object are constructed, from what parameters and how many times.
In short, for the first several hours, C++ coroutines are about as what one would expect from C++ in the worst way possible: a lot of nontrivial things to wrap your head around, with no clear picture forming right away, and with compiler diagnostic messages that do not seem helpful at first.
But then it hit me.
C++ is always about not paying for what you do not use.
So the Committee will always err on the side of making sure the resulting binary code is not doing extra work unless truly needed. And from this point, it is easy to explain all those constexpr
return types such as suspend_always
and suspend_never
.
Then next step for me was to realize where the true genius of C++ coroutines is. It is in separating the definitions of the coroutines from their runtime.
See, to embed coroutines into the language, there are two whales to stand on:
Language support for stackless functions, and
The execution model for these functions.
Language support, (1), is simply about being able to write await
(or, in C++, co_await
), and have the code yield execution to some other logic. Yes, eventually it is about fibers and green threads, and yes, this is about making sure the code lives in the user space for as long as possible, since kernel calls are expensive in massively parallel settings. But on the language support level, these are not part of the picture yet.
Another way to put it: language support, (1), is about introducing the way for the code to read linearly, without the callback hell.
The execution model, (2), is the mechanism that makes sure the code implemented within the body of a a coroutine does get executed. Eventually, each “elementary piece of coroutine code”, “from one suspend to another” if you wish, would need to be run by some OS/kernel thread.
Here’s what the C++ Committee effectively says here. Or at least that’s the reading of the Standard that brought peace and enlightenment upon me.
C++ should be a language for everyone. Everyone includes ultra-low-power and ultra-low-bandwidth CPUs, also known as MCUs. Take an 8-bit microcontroller with literally under one megahertz clock rate. If you can use C on it — and you can! — then you should be able to use C++ on it. Including using the C++ coroutines mechanism.
I can not emphasize enough how genius I now believe it is. Coroutines in C++20 are:
100% language constructs support + 0% execution layer.
Seen in this light, the challenge is to introduce stackless (and threadless!) functions so that the core language remains intact. And this is exactly what the std::coroutine_handle<>
magic does.
This also means that in the example code I wrote I had to implement the “trivial” execution layer. Which I did. Granted, in an overly-safe mutex-locked way, but I believe for [self-]educational purposes it is just fine.
Also, at some point, mostly for myself, I decided to replace the “true” calls to sleep()
by the notion of “fake time” that the executor keeps track of. And I was pleasantly surprised by how simple this turned out to be.
Last but not least: My long-held dream was to implement the wrapper that would allow synchronous and asynchronous functions operate interchangeably. In simple terms: if there is a type Async<int>
for an int
result value that can be await
-ed, I wanted to make sure an instance of this type can be created from a plain int
value, with no worries on the developer side of whether a certain piece of await
-driven logic is guaranteed to be a pass-through no-op, as the value is immediately available. The user may want to check this. This user may even want to check this at compile time, and perhaps add a static_assert
, or even some SFINAE magic. But the decision to use or to bypass this logic should be on the writer of the code. And this is exactly what I managed to enable.
Nuff talk. Let’s get to the code.
Code
All the code I am referring to can be found at https://github.com/dkorolev/cofizzbuzz.
During my experiments it became clear quite early on that if I structure the commits just right this easy-to-follow text will be a useful byproduct. Here it is.
Most of the code snippets below loosely follow the FizzBuzz example. It’s a toy problem, and it will take under one minute to get familiar with, on an off chance you have not heard of it before.
Part I
Also, the solution to this problem in C fits a tweet. The old, 2015, tweet. The 140 chatacters tweet. Including the #Ha hashtag at the end.
I took the liberty to modify the problem so that for 15 it should print “Fizz” followed by “Buzz”, not just “FizzBuzz”. As two distinct output lines, as in, as two distinct “callback” invocations. This is to make the early stopping illustration even more explicit.
Step 1
Initial FizzBuzz. It does not stop at 15 and prints 16 lines, since one `.Next()` call can yield more than one callback call.
The code for this example builds on the FizzBuzz exercise. With a tiny exception that numbers that are divisible by both three and five result in not one "FizzBuzz", but a "Fizz" followed by a "Buzz".
This step only checks the "stop at fifteen" condition on the very top level. So the code prints sixteen steps, not fifteen. It overruns the boundary by one. Because .Next()
called with value == 15
results in both “Fizz”
and “Buzz”
emitted.
Step 2
Introduced the "cache for the yet unseen value". It stops at 15 steps now.
This is an exercise in modifying the code so that the user-provided callback is invoked exactly once per the next term to be output.
Step 3
Introduced (synchronous) `IsDivisibleBy{Three,Five}`.
This is a trivial yet dedicated commit to branch out the "divisible by {three,five}" checks into their own functions.
These functions will become coroutines soon, but not before we make them slow on purpose, running from the same main()
thread, then from their own threads, then via the single executor thread, and then ultimately from this single executor thread using C++20 coroutines.
Step 4
Now "async"-style `IsDivisibleBy{Three,Five}`, a.k.a. the "callback hell".
To illustrate the pains of callback-based programming.
Also, introduced artificial sleep()
-s. They will soon help show how sequential waits become parallel waits.
Step 5
Measuring the time each step takes, confirming it's 20ms == 10ms + 10ms.
Added time measurements and confirmed two "simulated slow calls" are indeed not parallelized.
Step 6
A [one-off] demo of using futures+promises. Less of a callback hell, but `sleep` is still sync.
Demonstrating how futures and promises could be used. If only to get rid of the callback hell.
Spoiler alert: it will be even better. But it is important to note that in C++20 the implementation of the coroutines has nothing to do with futures and promises, even though the inner type for a coroutine frame/lifetime type must have the promise_type
sub-type.
Step 7
Use detached threads to have `10ms + 10ms = 10ms` "total" wait.
Sleep for "10ms" and "10ms" in parallel, so that 10ms + 10ms is still 10ms. Use detaching threads for now, this is still an illustrative example so far.
Step 8
Logging thread names, to illustrate the "work" is done by a large number of "dedicated" threads.
Proving that it is indeed plenty of freshly spawned threads. And laying the groundwork to later prove these threads converge into one executing thread.
Step 9
Explicitly verbose, ugly, event-driven logic. Work still initiated and executed by dedicated threads.
Further, deliberately ugly, illustration of how the state machine of the coroutine will work behind the scenes.
This code is ugly by design, it is only here to demonstrate what will eventually happen behind the scenes, automagically, once the coroutines are used properly.
Step 10
The dedicated, single, executor thread in which the work is done.
This commit is starting to put the pieces together.
First, instead of spawning and detaching multiple threads, "one per sleep", introduce a single, "executor" thread.
This thread acts as the scheduler, and it executes all the sleep()
-s — i.e. the code that is scheduled to run after a delay. This commit also makes the final graceful shutdown logic cleaner.
Step 11
A thread-local placeholder for the executor, to save on manual "wait 'til done".
Making the executor thread not a global one, but a scoped one, plus a thread-local singleton to ensure it is the user's responsibility to only run the "future coroutines" in the scope that is inner to the scope where a thread-local-singleton executor instance lives.
Part II
In this part, after all the prep work, we’re getting to C++20 coroutines for real!
Step 12
Added `#include <coroutine>`! Executing via `resume()`. On an example for now.
Finally, about to work with C++20 coroutines.
For the next few commits, the FizzBuzz example is on hold, the code is "just" printing integers in order, and checking if they are odd or even. The "coroutine handler" is merely the way to invoke .resume()
once.
So, this code does introduce a stackless function for real. But it's still the first, and largely trivial, example.
Step 13
Introduced `struct Coroutine` to contain the coroutine-related code.
Minor, having this trivial coroutine-related code work with the thread-local-singleton executor introduced before.
Step 14
Leveraging the thread-local-singleton executor in the coroutine-handling code.
This is done via a virtual ResumeFromExecutorWorkerThread()
call, which ultimately invokes .resume()
on a coroutine_handle<>
, and this call is made from the executor thread.
Now the code of the resumable stackless function truly is:
inline CoroutineBool IsEven(int x) {
// Confirm multiple suspend/resume steps work just fine.
if ((x % 2) == 0) {
co_await Sleep(1ms);
co_await Sleep(1ms);
co_return true;
} else {
co_await Sleep(1ms);
co_await Sleep(1ms);
co_await Sleep(1ms);
co_return false;
}
// Just `co_return ((x % 2) == 0);` would work too!
}
Step 15
Made the coroutine return type templated.
Here, I am starting to converge on the Async<T>
type, which will emerge in a few commits.
For starters, the hardcoded CoroutineReturningBool
becomes CoroutineReturning<bool>
. Also there is the CoroutineReturning<int>
type, to print the squares of integers, in addition to whether they are odd.
Step 16
The return value of the coroutine is now stored in the base `<T>` class.
Further refactoring, branched out the base class to store the return value, so that it can be specialized for <void>
. The "usual" C++ magic.
Also, it's just Coro<T>
after this commit. Will become Async<T>
once we're fully done.
Part III
Putting this all together, FizzBuzz + C++20 coroutines.
Step 17
Using the `Async<>` type for real.
Back to running FizzBuzz!
The "old" example code is still there, runnable with --example
.
I will use it later, at the very end, to illustrate the Ctrl+\
SIGQUIT-based reporting of what's currently running.
Step 18
Executor counters and debug logging. And `Coro<T>` is finally `Async<T>`.
Added basic telemetry, and finally introduced Async<T>
.
Also, I ran the test that if one call to co_await Sleep(10ms)
is replaced by two co_await Sleep(5ms)
calls, then the number of "executed steps" is growing — as expected.
Step 19
Made immediate values first-class citizens.
This is the commit that I wanted to get to. Making it happen got this whole thing started.
In this commit, a value that is Async<T>
can be returned from a coroutine or from a regular function. In simple terms, this commit extends the behavior of Async<T>
so that it can "pretend" to be a coroutine-aware type, but in fact it is the already-available value, so that no executor needs to get involved for the user to extract this value — via the "standard" co_await
technique.
Important note: This commit breaks g++
, at least in -O3
mode. Two versions of g++
in fact, on my Ubuntu and on the Github Action runner node. MacOS is fine, and I've just migrated to clang++
for the purposes of completing this work.
Step 20
Richer telemetry counters in debug mode.
Introduced more executor-level stats. This commit will be self-explanatory.
Step 21
The time is now simulated.
This is another important commit.
Instead of properly sleeping, the coroutines engines we have now can simulate the time going forward. And this is what this commit does.
It also uses the std::condition_variable
+ std::unique_lock
combo, instead of the busy wait loop, just as the real code should.
Step 22
Added real `sleep()`-s back to the now-simulated time.
Finally, it can be instructed to truly sleep()
for a certain number of milliseconds per the specified "time unit". But sleeping is no longer mandatory.
Step 23
Made `cout` thread-safe.
The step just before the last one, when Ctrl+\
will be printing real-time stats on the running coroutines, we need a cout
that is thread-safe.
It's a trivial self-contained C++ exercise.
Step24
Catching `Ctrl+\` (SIGQUIT) and printing the coroutines executor state.
This step is truly taking it too far, but it's the last one.
Now, Ctrl+\
, the SIGQUIT
signal, is intercepted, and instead of terminating the application real-time stats are printed.
I also go a bit too far, on purpose — since this is a demo, not production code! — and introduce some macros for coroutine-related stuff. These macros inject extra data into the executor, so that this Ctrl+\
-produced output contains human-readable data not only on how many coroutines are currently running and what their memory addresses are, but a) the names of those "coroutine function calls" that started the respective coroutine frames (fibers), and b) the exact location of what are those fibers currently co_await
-ing on.
Here’s how it looks like when pressing Ctrl+\:
And for the non-FizzBuzz code:
Outro
This is not production code. In a way, it’s a tutorial I wanted to create. Primarily because I firmly believe the best way to ensure I understand something is to be able to explain it well 😼
One “wrong” thing I did very deliberately though is a lot of using
-s. It’s absolutely a red flag in real C++ code. But a friend who I trust once argued that a major advantage of Python is that it maintains a healthily-high density of important yet readable language constructs per square inch. Since then, when I’m writing code the purpose of which is to illustrate a concept, I am adopting the principle of as little gutter as possible.
Also, we did not cover exceptions and co_yield
. And the code is excessively thread-safe for a single-threaded executor. Alas, simplicity is key.
So long and see you in the next posts!