Kafka, Brod, and the Ig Nobel in C++ literature

TL;DR: Redpanda is to Kafka, what Scylla is to Cassandra, that is a high-performance C++ rewrite of an otherwise well-known and popular project and API. Below I describe how I resolved an Elixir Kaffe and Erlang Brod consumer group rebalance issue when Vectorized Redpanda, instead of the original Apache Kafka, was the message broker. The change was sent upstream and successfully merged by the kafka4beam team. Thanks!

As the summary says, Vectorized Redpanda is a high performance C++ implementation of the Apache Kafka broker, protocol, and the complete producer/consumer model:

A Kafka® API compatible streaming platform for mission-critical workloads.

In one of my most recent projects I had a chance to work with Redpanda, which is still a relatively new piece of technology. Working with new technology may be quite exciting, quite so when it works well, but possibly even more when it doesn’t. In each case, the excitement is just of a bit different kind ;)

The system was architected as a set of Elixir microservices communicating via a message broker. Some of them were exposed to the public via a CDN, but the majority processed events consumed from the broker and interfaced with two distinct databases. All the components ran on Kubernetes. All in all, communication with the message broker was crucial.

The system was supposed to handle user money, which was one of the arguments for using Kafka / Redpanda, since they’re persistent message brokers. After some preliminary research and prototyping of the retry mechanisms available in various Kafka client libraries we decided to go with Elixir Kaffe. The library does everything we would have to do anyway if using Brod directly, packages it in a nice Elixir wrapping, yet doesn’t shoehorn the underlying model into an oversimplified or cumbersome abstraction.

However, it has soon become worrying that configuring our services to consume from more than one topic made it practically impossible to process any messages. The service startup was taking more time than expected, then at most a few messages were consumed, then the service was getting disconnected from the broker and tried reconnecting.

We tried switching between Kaffe’s :worker_per_partition and :worker_per_topic_partition allocation strategies, but with no luck. By trial and error we confirmed the problem only manifests when the service is configured to consume from more than one topic. Ultimately, some code diving into Kaffe and Brod and the tried and tested techniques of tracing BEAM code (dbg is great even in Elixir!) led me to pinpoint the problem to subscribers being disconnected from the broker. The question that remained unanswered was why.

Fortunately, Redpanda logging can be very granular when the --default-log-level=trace option is passed to rpk.

As you can see, the level of detail is overwhelming, so it felt like looking for a needle in a haystack.

If you’re interested in what exactly is happening in the log above, then here it goes.

With some time it made sense — the consumer group was rebalancing with no apparent reason. Tweaking Redpanda options (e.g. heartbeat_rate, rebalance_delay_ms) and their counterparts on Brod side didn't give any reasonable effects.

Having scratched my head for a while, I retried with Redpanda v21.7.3. Some log messages had different wording or format, but in general, the same scenario happened. It occurred to me I’m not getting any further with this line of investigation, so I realised the next step is to search the Redpanda broker source for relevant messages like Scheduling initial debounce join timer for 5000 ms and Join completion scheduled in .... Who would've expected that dabbling in C++ in high school would pay off 15 years later?

It turned out that rebalance timeout computation returned -1, which did not seem to be a reasonable timeout value, especially given there’s a clause which should’ve raised an exception if this timeout couldn’t be computed. The -1 seemed to be the rebalance_timeout field of the existing group_members, but I wasn't sure how it got there. It came from a join_group_request class, which in turn got it from join_group_request_data struct, but the last one did not seem to be defined in the repo. This lead me to check if our client actually sent the rebalance_timeout value on join_group request - after all, it might've been an issue on our end.

I worked on this further and my research showed that Brod didn’t send the rebalance_timeout_ms field on the join_group request. Hacking the lib to do that allowed the group to stabilise and an according PR was accepted by the kafka4beam team. To be precise, the library used to default to request version 0, which did not define rebalance_timeout_ms yet. So technically speaking the library was not buggy - it just tried to use the lowest supported protocol version with the hope it would work. In conjunction with the broker-side timeout default of -1 it led to an indefinite consumer group rebalance cycle. @zmstone, one of Brod maintainers, pointed out that newer Kafka versions fall back to using session timeout for rebalance timeout, if the latter is not available - I forwarded that to the Vectorized team, so hopefully that'll get incorporated in some future version of Redpanda. Apart from this one glitch, we had no issues with Redpanda or Kaffe/Brod whatsoever.

To sum up:

Thanks for your time. Stay safe!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store