Monitoring and adjusting your Apache Kafka topic partitions

How to check your pipeline performance and adjust the topic partitions accordingly

August 10, 2023

Kafka partitions enable parallel processing of your streaming data. Having the right number of partitions ensures that your Kafka consumers are processing as much data as they can.

As a rule, the number of partitions depends on several factors, such as what data throughput is expected and the processing time, i.e. the speed at which the data sets are processed per consumer. The number of consumers also determines the minimum number of partitions a topic should have.

In a recent post, we provided some helpful advice on calculating the number of Kafka partitions.

However, it is often not known in advance exactly what data volumes will be generated in production and what external circumstances might further influence the pipeline leading to an increase in processing time.

Therefore, it is important to regularly check the pipeline performance and adjust topic partitions if necessary. In this post we will demonstrate how to check your pipeline performance and adjust the topic partitions accordingly.

How to tell if you need to increase your topic partitions

"If you're not moving forward, you're falling back." Topic partitions in Kafka can only be increased, not decreased. Therefore, we recommend starting with a lower number of partitions and increasing them successively if you see performance problems in your pipeline.

The easiest way to detect problems is by looking at the consumer lag. If a consumer group is too slow, it will fall behind. The difference between the most recent data set in a partition and the last processed data set of a consumer is called "consumer lag".

If a consumer or an entire consumer group processes data more slowly than it is generated, i.e. the consumer lag increases, it continuously falls behind. Not only does this further delay the reading applications and users have to wait longer and longer for up-to-date data, but there is also the risk that after a certain point, the data will be deleted from Kafka before the consumer or consumer group has had a chance to read it. This can be controlled by the retention configuration of a topic.

On Kadeck's Consumer page shown above (use it for free), you see a consumer group "Risk_Engine" with one consumer. The max group and total group lag numbers are displayed in red because they are particularly high. You can configure the threshold for the color coding individually.

If we look closer, we see that the topic provides four partitions, but only a single consumer consumes from all four partitions. In such a case, we start by adding another consumer to see if this solves our performance problem.

If we still see an increase in latency, we continue by adding two more consumers for a total of four consumers so that the partitions are evenly distributed. An odd number would cause one consumer to have more partitions and thus more workload than the other consumers. We definitely want to avoid such a setup.

If these measures did not bring the desired success, we should think about increasing the partitions of the topic to be able to add more consumers. Since partitions are the smallest unit of distribution, increasing the consumers without increasing the partitions would have no effect, since the total number of partitions in a topic dictates the total number of consumers in a consumer group for a given topic. Additional consumers would simply be idle and would never be assigned a partition because all partitions have already been assigned to other consumers in the consumer group.

Partitions can be easily added in Kadeck via the Topic Configuration view, which can be accessed from the Data Catalog.

It is recommended to configure the new number of partitions as a multiple of the consumers, so that the partitions are distributed evenly among the consumers.

What happens if you change your partition count?

Before you increase the number of partitions, you should be aware of the implications. In the following, I will briefly discuss what happens when you increase partitions and how this can affect your Kafka system.

First and foremost, you need to make sure you understand how this will affect your keying strategy. If your messages are keyed, it is important to understand how the keys are distributed across the partitions, as the new partitioning may result in a different assignment of keys to partitions, potentially affecting the message ordering guarantee.

Adding partitions to a topic in Kafka triggers a reassignment of partitions, which causes a temporary performance hit as the data is shuffled, so it is wise to do this during a maintenance window or when the system is under less load.

Your cluster will also require more resources, as more partitions mean more open file handles and threads for the broker. You need to make sure that your Kafka broker has enough resources to handle the additional partitions.

Because partitions are replicated across multiple brokers in your cluster, both recovery time and latency may increase: when a broker shuts down cleanly, leaders are proactively moved with minimal unavailability; however, unclean shutdowns can cause significant delays proportional to the number of partitions. If the failed broker is the controller, additional delays may occur as the new controller reads metadata from Zookeeper (or the other brokers). To maintain availability during rare unclean failures, it may be wise to limit the number of partitions per broker and the total number of partitions in the cluster.

More partitions means more data to replicate, which increases overall latency because replication takes longer.

Conclusion

Optimizing Kafka topic partitions is not a set-it-and-forget-it operation. It requires ongoing monitoring and occasional adjustments to handle evolving data volumes and processing requirements.

Starting with a conservative number of topic partitions and carefully increasing them as needed, while considering the various factors outlined above, allows you to maintain an efficient and reliable Kafka streaming pipeline.

By leveraging Kadeck and applying best practices, you can ensure that your Kafka deployment is scalable, resilient, and capable of handling your streaming data needs.

Get the free-forever Kadeck GUI tooling for Apache Kafka, Amazon Kinesis, and Redpanda.

‍

Monitoring and adjusting your Apache Kafka topic partitions

Latest posts

Interview: Data Streamhouse - IT in Flow

The Data Streaming Revolution

The Data Streamhouse