Kafka, what is it?

A quick explanation of Kafka for beginners with examples. Why use it? How does it work? 2020-09-26

Find more tutorials explaining the basics of a subject in ten minutes with the tag what is it.

Apache Kafka is a technology used to exchange data between pieces of software.

There are 3 main uses of Kafka:

As a message queue system, asynchronously exchanging events and the related data between services. It's a core concept of a microservices infrastructure, improving the separation of concerns of each service. Don't worry if you didn't fully understand this sentence, you will in 2 minutes.
The following tutorial will mainly be oriented around this use case.
As an ETL (Extract-Transform-Load), connecting sources of data between them and allowing a transformation of the data. Kafka is mainly used in case of a need for real time analysis: a data is injected into Kafka, transformed in real time then directly loaded into a database.
As a source of truth. Kafka is a distributed and viable system. It can be used to store sensitive data, the current state of this data and the different events leading to this current state. It is the concept of event sourcing: reconstruction of a current state with the records of every steps.

This is a theoretical tutorial. Learn how to put Kafka in practice in the next tutorials.

Messages

A Kafka message is a piece of information needed to be shared across several systems. Some of these systems will write a message and other systems will read it.

Example

John ordered a burger.

The best part here is that the system writing the message doesn't care to know who is reading its message. As a message queue system, Kafka helps to split a system into small independent pieces, each one ensuring a service in total autonomy.

Example

The service taking orders writes the message "John ordered a burger".
The service delivering the orders reads the message and treats it to deliver the burger.
The data analytic service reads the message and updates in real-time its data.

You may have noticed that all the messages in the examples are formulated as "events". This is absolutely not a constraint from Kafka, it's a common practice used in a microservices architecture. Always formulate your messages as events!

Example

You will not say "John wants mustard instead of ketchup in his burger" but rather "There is an update on the order of John: the new sauce is mustard".

Once a message is in Kafka, it cannot be altered. If you want to invalidate a message, you must send a second message invalidating the first one.

Example

A customer disabled his subscription. If he re-enables it, you cannot delete the message "subscription disabled", you need instead to send a message "subscription enabled".

Because anyone can read your messages, the format needs to be consistent across time: you cannot delete or transform an information, which may affect the readers. Adding extra info is allowed since no reader can be impacted.

Example

"John shared on twitter the tutorial Kafka what is it?". We cannot remove any information from this message and just send in the future "Jane shared the tutorial Kafka what is it?": the missing "twitter" can have an impact on some of the consumers. However "Jane shared on twitter the tutorial Kafka what is it? at 8 PM" is allowed.

Since both represent some kind of events, it's important for the safety of your software to not confuse "log entries" and "messages". Log entries are pieces of info used to monitor and debug a platform, they don't have any direct business impact. Messages on the other side are used to provide useful data to a system with a business impact.

Example

"The user John Doe completed the journey in order to buy the last iPhone". This event is aimed to be used by other systems like the one delivering the iPhone to John: it's a message.
"The SQL server treated the query in 5 seconds". While this event can be used by a monitoring software, we mainly consider it as a log entry.

As you can see, the boundary between log entries and messages can be tight. It's up to you to identify what to put in which box and what to send through Kafka.

Topics

A Kafka topic represents the main subject, the category, of a group of messages.

Example

The message "the new user John Doe just registered" will be published in the topic "a new user just registered".

Physically, a topic is nothing more than commit logs and the messages are the lines / transactions / commits of these logs. You can visualize topics as escalators carrying people (messages) in real time. On this escalator, it is forbidden to overtake someone: people follow a FIFO (First In First Out).

Note that "real time" here means "continuously". In fact there can be a delay of several minutes, even several hours in the worst case scenario, between the writing of a message and its reception by the readers. For a real quasi-synchronous system, Kafka is maybe not the best solution since there is no guarantee on the delivery time of the messages: you should consider a good old fashioned HTTP API.

Producers and consumers

The producer is the system publishing, writing, producing messages into topics. For most of the programming languages, you can find a library serving as an abstraction layer: producing a message can often be done in less than 10 lines of code.

Technically, Kafka uses the term "producer record" to designate what is produced by a producer. The record contains technical fields like: the topic, a timestamp, a key, a value (aka the message content) and so on...

The consumer is the system reading the messages. It's basically an infinite loop, subscribing to topics, waiting for new messages to consume.

Brokers and clusters

A broker is a software process, a deamon running on a physical or virtual machine: basically it's the Kafka server. The producers send the messages to the brokers, the brokers sort them into topics and the consumers retrieve them.

A cluster is basically a group of brokers.

Zookeeper

Apache Zookeeper is a technology used to create distributed systems. Lots of software, including Kafka, use it.

We just saw that the real worker, the one doing the job, is the "broker". There can be 1 or several brokers in a group called a "cluster". Saying that the cluster is a "distributed system" means that:

The brokers can communicate with each other using a "gossip protocol". They can share some configurations, the status of the cluster, who is responsible for which task and so on...
Some brokers are elected as "controllers" to coordinate the nodes of the cluster: dispatch the work to do, monitor the cluster's health and so on...
It has a highly scalable architecture in which thousands of brokers can be plugged.
It is fault tolerant: data is replicated so that there can't be any data loss. Remember that a topic is just a bunch of commit logs: easily replicable, all the commits / transactions can be replayed to compute the final status.

The controller delegates this task by promoting some workers to leaders. The leader takes the task to do and manages the replication according to a special configuration: the "replication factor".

The non-leader nodes are called "followers". Together, the leader and its followers form a quorum. If a leader cannot constitute a quorum, the task is re-assigned to another leader.

Partitions

A topic is composed of one or several commit logs: each log is called a "partition".

Earlier, you visualized the topic as an escalator on which people go through. Actually, this escalator is not the topic but rather a partition of this topic: you can now visualize the topic as a set of escalators. Remaining in the analogy, imagine a single little escalator at the exit of the Madison Square Garden: it will definitely not support the load of people. In this case, we will need several escalators serving the same goal.

OK but in this situation, a person queuing for the escalator 2 can exit before a person queuing for the escalator 1 even if he comes later?

Yes indeed, if you absolutely need the messages to be ordered by time, having several partitions can be tricky. This issue must be resolved on the consumer side, Kafka cannot help you on this one.

Speaking of consumers, I told you before that a consumer will subscribe to whole topics: it's true but if the load is too heavy, it can also directly be assigned to a specific partition of a topic.

Zookeeper assigns the partitions to leaders, broadcasting to all leaders who is responsible for what. When a producer wants to publish a message, it can communicate with any of the leaders. The leader tells the producer which broker will manage its "producer record" according to a partitioning strategy: the producer can directly indicate a partition in the record, a modulo can be done on the provided key and so on...

Because it's just a file, a partition cannot be split across several brokers: a partition is located on only one broker. It means that a partition with a large amount of messages can overload not only the inputs/outputs and the CPU of the broker, but also its disk space.

Offsets

In order to not consume twice the same messages, each consumer has its own "consumer offset" by partition. An offset is a message ID. Committing the offset is a task executed by the consumer: meaning that if the consumer doesn't commit the offset each time it treats a message, there can be a difference between what Kafka knows, the "last committed offset", and the "current position" of the consumer. Note that a configuration option "auto-commit" is enabled by default, but it can be really messy so I would not recommend to use it: it could commit something that you failed to treat for instance.

For a new consumer, the directive "from-beginning" takes, like its name suggests, all the messages of the topic from the beginning.

Fun fact, the offsets are stored by Kafka in a special topic: __consumer_offsets.

Consumer groups

A consumer can subscribe to topics but the burden can be too heavy. You can explicitly assign a consumer to a single partition but this make your system hard to maintain: a new partition could be added for instance.

A consumer group is a set of independent consumers sharing some info, like the offsets, and doing the same treatment on the same targets. When subscribing to the topics, the consumer can join a group by providing a group ID. Zookeeper promotes a broker as "group coordinator", it manages the group members and dispatches the load of messages by assigning the consumers to partitions. Note that, in a group, a partition can only have one consumer but a consumer can be assigned to several partitions. If a consumer dies, if a new consumer or a new partition is created, the coordinator engages a "rebalancing" protocol.

MISC info

The producer records do not directly go to Kafka, they are first wrapped in an object containing several records called "record batch". There is one batch by couple topic / partition. The batches are stored in a buffer called "record accumulator". This batching concept can also be found on the consumers and some other places in the Kafka engine, reducing the network and disk workload.
There is several acknowledgement modes guaranteeing to a producer that the cluster well-received its message: acknowledgement by the leader, by the whole quorum, no acknowledgement at all and so on...
By default, the messages are retained in the topic for 168 hours / 7 days. After this period, the messages are lost: it's configurable by topic.

Learn more about Kafka on the official site.

Have a nice day :)