The Apache Kafka "Hello World"
This is a blog post from our Community Stream: by developers, for developers. Don’t miss to stop by our community to find similar articles or join the conversation.
Before we find out why the Kafka rocket can fly at all and how it does so, we ignite the engines together and take a short test flight:
For our flight into space, we launch our (of course reusable) rocket, briefly leave the atmosphere to practice re-entry, and gently and safely land the rocket on a landing pad.
We want to capture all flight phases in Kafka and create a topic first. Topics are similar to tables in databases in that we store a collection of data of a certain kind in a topic. In our case, it is flight data, so we call the topic flightdata:
We use the kafka-topics.sh command to manage our topics in Kafka. Here, we tell Kafka to create the topic flightdata (--topic flightdata) with the --create argument. First, we start with one partition (--partitions 1) and without replicating the data (--replication-factor 1). Last, we specify which Kafka cluster kafka-topics.sh should connect to. In our case, we use our local cluster, which by default listens on port 9092 (--bootstrap-server localhost:9092). The command confirms us the successful creation of the topic. If we get errors here, it is often because Kafka has not started yet and is therefore not accessible, or because the topic already exists.
So now we have a place to store our data. The rocket’s board computer continuously sends us updates on the rocket’s flight state. For our simulation, we use the command line tool kafka-console-producer.sh. This producer and also other useful tools are shipped with Kafka directly. The producer connects to Kafka, takes data from the command line and sends it as messages to a topic (configurable via the --topic parameter). Let’s write the message countdown started into our just created topic flightdata:
Our ground station now wants to read this data and output it on a large screen so that we can see whether the rocket really works as we expect it to. Let’s take a look at what has happened so far. To read our sent message again, we start the kafka-console-consumer.sh, which is also part of the Kafka family:
When we start the kafka-console-consumer.sh, it continues to run by default until we actively cancel it (for example, with CTRL+C). This would also be the desired behavior if we really wanted to display the current state of the rocket somewhere. In our example, we use the timeout command to make the consumer terminate automatically after 10 seconds at the latest. For the consumer, we have to specify again which topic it should use (--topic flightdata).
Somewhat surprisingly, no message is displayed. This is because by default, the kafka-console-consumer.sh starts reading at the end of the topic and only prints new messages. To display also already written data, we have to use the flag --from-beginning:
This time we see the message Countdown started! So, what happened? We used the kafka-topics.sh command to create the topic flightdata in Kafka and used the kafka-console-producer.sh to produce the message Countdown started. Then, we read this message again with the kafka-console-consumer.sh. We can represent this data flow as follows:
Without any other information, the kafka-console-consumer.sh always starts reading at the end. That means, if we want to read all messages, we have to use the flag --from-beginning.
Interestingly, respectively unlike in many messaging systems, we can read messages not only once, but as many times as we want. We can use this, for example, to connect several independent ground stations to the topic so that they can all read the same data. Or there may be different systems that all need the same data. We’re capable of imagining having not only a display screen, but in addition other services too, such as a service that compares the flight data with current weather data and decides if something needs to be done. However, we may also want to analyze the data after the flight and require the historical flight data for that. To accomplish this, we can simply run the Consumer multiple times and each time get the same result.
However, we would now like to display the current state of the rocket in our control center, in such a way that the display updates immediately when there is new data. For this, we start the kafka-console-consumer.sh (without timeout) in another terminal window. As soon as new data is available, the consumer fetches it from Kafka and displays it on the command line:
To simulate the producer on the rocket side, we now start the kafka-console-producer.sh. The command does not stop until we press CTRL+D, sending the EOF signal to the Producer:
The kafka-console-producer.sh sends one message to Kafka per line we write. That means we can now type messages into the terminal with the producer:
We should also see these promptly in the window with the consumer:
Let’s imagine that part of our ground crew is in the home office and wants to follow the flight from home. To do so, they independently start their consumer. We can simulate this by starting a kafka-console-consumer.sh in another terminal window, which displays all data from the beginning:
We can see here that data, once written, can be read in parallel by multiple consumers without the need for the consumers to talk to each other or for the consumers to register with Kafka first. Kafka does not delete any data, which means that we can still start a consumer later that can read historical data.
Now, let’s say that we don’t want just one rocket to fly at a time, but several. Kafka has no problem with this and can easily process data from numerous Producers simultaneously. So let’s start another Producer in another terminal window:
We see all the news from all the Producers show up in all our Consumers in the order the news was produced:
But now the issue arises that we cannot distinguish the messages from rocket 1 and 2. We could possibly write in each message from which rocket the message was sent. But more about that later.
Before we continue, we should abort the flight of our second rocket, as we want to launch it later:
We have now successfully launched and landed a rocket on a test basis. In the process, we wrote some data in Kafka. To accomplish this, we first created a topic flightdata using the command line tool kafka-topics.sh, in which we wrote all the flight data for our rocket. Into this topic, we produced some data using kafka-console-producer.sh. In our case, this was information about the current status of the rocket. We were able to read and display this data using kafka-console-consumer.sh. We even went further and produced data in parallel using multiple producers and read data simultaneously using multiple consumers. With the --from-beginning flag in kafka-console-consumer.sh, we accessed historical data. Thus, we have already become familiar with three command line tools that ship with Kafka.
After gaining this experience, we can now close all open terminals. We close Producer with CTRL+D and Consumer with CTRL+C. This example should not hide the fact that Kafka is used wherever larger amounts of data are processed. From my training experience, we know that Kafka is used intensively by many car manufacturers, supermarket chains, logistics service providers and even in many banks and insurance companies. In this blog post, we got an initial overview of Kafka and ran through a simple test flight scenario. If you want to dig deeper, you can find more details about the Kafka architecture in my German book and in upcoming Xeotek Community articles.