Tales of Kafka at Cloudflare

At QCon London, Andrea Medda, Senior Systems Engineer at Cloudflare, and Matt Boyle, Engineering Manager at Cloudflare, shared the lessons their platform services team learned from enabling the use of Apache Kafka at the scale of 1 trillion messages.

Matt began by outlining the problems that Cloudflare needs its technology to solve, namely providing its own private and public cloud, and the operational challenge of coupling between teams that arose as their business needs grew and evolved. He went on to identify how Apache Kafka was selected as their implementation of the message bus pattern.

While the messagebus pattern enabled the decoupling of load between microservices, Matt explained how services still ended up being tightly coupled because of an unstructured approach to schema management. To solve this problem, they opted to migrate from JSON messages to Protobuf and to build a client-side library to validate messages prior to publishing them.

As the adoption of Apache Kafka grew across their teams, they developed a Connector Framework to make it easier for teams to stream data between Apache Kafka and other systems while transforming the messages in the process.

Over the pandemic, as load on Cloudflare’s systems grew, the team began to observe bottlenecks on a key consumer which had begun to breach its Service Level Agreements. Andrea explained how the team’s initial struggle to identify the root cause of the issue prompted them to enrich their software development kits (SDKs) with tooling from the Open Telemetry ecosystem to gain better visibility of interactions across their stack.

Andrea went on to highlight how the success of their SDKs brought more internal users which spurred a need for better support in the form of documentation and ChatOps.

Andrea summarized the key lessons as:

  • Striking the balance between highly configurable and simple standardized approaches when providing developer tooling for Apache Kafka
  • Opting for a simple and strict 1:1 contract interface to ensure maximum visibility into the workings of topics and their usage.
  • Investing in metrics on development tooling to allow problems to be easily surfaced
  • Prioritizing clear documentation on patterns for application developers to enable consistency in adoption and use of Apache Kafka.

Finally, Matt shared a new internal product, called Gaia, that the team was building to enable push-button creation of services according to Cloudflare’s best practices.