Setting up Apache Kafka cluster for web application
Kafka is not just a message queue. It's a distributed log with ordering guarantees, replication, and the ability to replay events from any point. For a web application this means: asynchronous event processing, service decoupling, audit logs, event sourcing, real-time analytics.
A single broker suits only development. A production cluster requires at least 3 brokers with replication.
Choosing mode: KRaft vs ZooKeeper
Since Kafka 3.3+, KRaft mode (without ZooKeeper) became production-ready and is the recommended approach. For new setups — KRaft only.
3-node cluster in KRaft mode:
- kafka-1: controller + broker
- kafka-2: controller + broker
- kafka-3: controller + broker
Installation on Ubuntu 22.04
# Java is required
apt install -y openjdk-21-jdk-headless
# Download Kafka
KAFKA_VERSION=3.7.0
SCALA_VERSION=2.13
wget https://downloads.apache.org/kafka/${KAFKA_VERSION}/kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz
tar -xzf kafka_${SCALA_VERSION}-${KAFKA_VERSION}.tgz -C /opt/
ln -s /opt/kafka_${SCALA_VERSION}-${KAFKA_VERSION} /opt/kafka
useradd -r -s /bin/false kafka
chown -R kafka:kafka /opt/kafka
mkdir -p /var/log/kafka /data/kafka
chown kafka:kafka /var/log/kafka /data/kafka
KRaft configuration (on each node)
/opt/kafka/config/kraft/server.properties — different for each node:
# Node 1 (change node.id and advertised.listeners for nodes 2 and 3)
node.id=1
process.roles=broker,controller
controller.quorum.voters=1@kafka-1:9093,2@kafka-2:9093,3@kafka-3:9093
listeners=PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
advertised.listeners=PLAINTEXT://kafka-1.internal:9092
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
# Storage
log.dirs=/data/kafka
num.recovery.threads.per.data.dir=4
# Performance
num.io.threads=16
num.network.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
# Replication
default.replication.factor=3
min.insync.replicas=2
num.partitions=6
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
# Retention
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
# Compression
compression.type=lz4
Storage initialization (one time):
# Generate cluster UUID (once, same for all nodes)
CLUSTER_UUID=$(kafka-storage.sh random-uuid)
# Format storage on each node
kafka-storage.sh format \
-t $CLUSTER_UUID \
-c /opt/kafka/config/kraft/server.properties
Systemd unit
[Unit]
Description=Apache Kafka
After=network.target
[Service]
Type=simple
User=kafka
Environment="KAFKA_HEAP_OPTS=-Xmx4g -Xms4g"
Environment="KAFKA_JVM_PERFORMANCE_OPTS=-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/kraft/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Setting up TLS between brokers and clients
Without TLS all traffic is sent in plaintext. Minimum — TLS for external clients.
# Generate CA and certificates for each broker
keytool -keystore kafka-1.keystore.jks -alias kafka-1 \
-keyalg RSA -validity 365 \
-genkey -storepass changeit \
-dname "CN=kafka-1.internal, OU=Kafka, O=Company, L=City, ST=State, C=US"
# Sign with CA
keytool -keystore kafka-1.keystore.jks -alias kafka-1 \
-certreq -file kafka-1.csr -storepass changeit
openssl x509 -req -CA ca.crt -CAkey ca.key \
-in kafka-1.csr -out kafka-1-signed.crt \
-days 365 -CAcreateserial
Add to server.properties:
listeners=PLAINTEXT://0.0.0.0:9092,SSL://0.0.0.0:9094,CONTROLLER://0.0.0.0:9093
ssl.keystore.location=/etc/kafka/ssl/kafka-1.keystore.jks
ssl.keystore.password=changeit
ssl.key.password=changeit
ssl.truststore.location=/etc/kafka/ssl/kafka.truststore.jks
ssl.truststore.password=changeit
ssl.client.auth=required
ssl.enabled.protocols=TLSv1.3,TLSv1.2
Monitoring — JMX + Prometheus
# kafka-jmx-exporter.yml — configuration for JMX Exporter
startDelaySeconds: 0
hostPort: 127.0.0.1:9999
lowercaseOutputName: true
rules:
- pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>OneMinuteRate
name: kafka_server_broker_topic_messages_in_per_sec
- pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
name: kafka_server_under_replicated_partitions
- pattern: kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value
name: kafka_controller_active_count
- pattern: kafka.network<type=RequestMetrics, name=TotalTimeMs, request=Produce><>99thPercentile
name: kafka_network_produce_total_time_ms_p99
Key metrics for alerting:
-
kafka_server_under_replicated_partitions > 0— replica loss -
kafka_controller_active_count != 1— controller issue - consumer lag > threshold — consumer lagging
Initial testing
# Create test topic
kafka-topics.sh --bootstrap-server kafka-1:9092 \
--create --topic test-topic \
--partitions 6 --replication-factor 3
# Check replication
kafka-topics.sh --bootstrap-server kafka-1:9092 \
--describe --topic test-topic
# Producer performance test
kafka-producer-perf-test.sh \
--topic test-topic \
--num-records 1000000 \
--record-size 1024 \
--throughput -1 \
--producer-props bootstrap.servers=kafka-1:9092,kafka-2:9092,kafka-3:9092 \
acks=all compression.type=lz4
# Consumer test
kafka-consumer-perf-test.sh \
--bootstrap-server kafka-1:9092 \
--topic test-topic \
--messages 1000000 \
--group perf-test-group
Project timeline
Day 1 — infrastructure preparation: 3 VMs/servers with separate disks for Kafka data (not system partition), DNS setup, open ports 9092/9093/9094 between nodes.
Day 2 — Java installation, Kafka setup, cluster UUID generation, storage formatting, systemd configuration, cluster startup.
Day 3 — TLS setup, creation of production topics with correct partition/replication factors, performance testing.
Day 4 — monitoring integration (JMX Exporter + Prometheus + Grafana), alert configuration for under-replicated partitions and consumer lag.
Day 5 — failure scenario testing: broker shutdown, verify cluster continues, recovery.
Additionally: Kafka Schema Registry and Kafka Connect setup adds 2–3 days each.







