Apache Kafka has become the backbone of modern event-driven architectures. Whether you're building real-time data pipelines, implementing microservices communication, or processing millions of events per second, Kafka knowledge is essential for backend and data engineering roles.
This guide covers the Kafka concepts that interviewers actually ask about—from core fundamentals to production operational knowledge.
Table of Contents
- Kafka Architecture Questions
- Topic and Partition Questions
- Producer Questions
- Consumer Questions
- Exactly-Once Semantics Questions
- Kafka Streams Questions
- ksqlDB Questions
- Spring Kafka Questions
- Replication and Reliability Questions
- Performance Tuning Questions
- Monitoring and Operations Questions
Kafka Architecture Questions
Understanding Kafka's architecture is the foundation for all advanced topics.
What are the main components of Kafka architecture?
Kafka's architecture consists of several interconnected components that work together to provide distributed, fault-tolerant event streaming. At its core, a Kafka cluster contains multiple brokers that store data and serve client requests. Topics organize messages into logical categories, while partitions enable horizontal scaling and parallel processing.
Each partition has a leader replica that handles all reads and writes, with follower replicas providing fault tolerance. Consumer groups coordinate multiple consumers to process messages in parallel while ensuring each message is processed exactly once within the group.
flowchart TB
subgraph cluster["Kafka Cluster"]
direction LR
subgraph B1["Broker 1"]
P0L["Topic A Part 0<br/>(Leader)"]
P1R["Topic A Part 1<br/>(Replica)"]
end
subgraph B2["Broker 2"]
P1L["Topic A Part 1<br/>(Leader)"]
P0R1["Topic A Part 0<br/>(Replica)"]
end
subgraph B3["Broker 3"]
P2L["Topic A Part 2<br/>(Leader)"]
P0R2["Topic A Part 0<br/>(Replica)"]
end
end
Producer["Producer"] --> cluster
cluster --> CG["Consumer<br/>Group"]Key components:
- Broker: A Kafka server that stores data and serves clients
- Topic: A category/feed name to which records are published
- Partition: An ordered, immutable sequence of records within a topic
- Replica: A copy of a partition for fault tolerance
- Leader: The replica that handles all reads and writes for a partition
- Consumer Group: A set of consumers that cooperate to consume a topic
What is Zookeeper's role in Kafka and what is KRaft?
Zookeeper traditionally served as Kafka's coordination service, managing cluster membership, topic configuration, partition metadata, controller election, and access control lists. However, this dependency added operational complexity and limited Kafka's scalability to around 200,000 partitions per cluster.
KRaft (Kafka Raft) is the new metadata management system that eliminates the Zookeeper dependency entirely. It uses a Raft-based consensus protocol where some brokers act as controllers, storing metadata in an internal Kafka topic. This simplification became production-ready in Kafka 3.3.
# KRaft mode configuration (Kafka 3.3+)
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@localhost:9093,2@localhost:9094,3@localhost:9095Benefits of KRaft:
- Simplified operations (one system instead of two)
- Better scalability (millions of partitions)
- Faster controller failover
- Reduced operational complexity
Interview tip: Know that KRaft is production-ready as of Kafka 3.3 and Zookeeper is deprecated.
Topic and Partition Questions
Topics and partitions are fundamental to understanding how Kafka scales and maintains ordering guarantees.
What is a Kafka topic and how does it differ from a traditional message queue?
A Kafka topic is a logical channel for publishing and subscribing to streams of records, functioning as the primary abstraction for organizing data. Unlike traditional message queues where messages are deleted after consumption, Kafka retains messages based on configurable retention policies, allowing multiple consumers to read the same data and enabling replay from any point in time.
This fundamental difference in data retention model makes Kafka suitable for use cases like event sourcing, audit logging, and stream processing where historical data access is valuable.
| Aspect | Traditional Queue | Kafka Topic |
|---|---|---|
| Message retention | Deleted after consumption | Retained based on policy |
| Multiple consumers | Message goes to one consumer | All consumer groups get all messages |
| Replay capability | Not possible | Can replay from any offset |
| Ordering | Queue-wide (typically) | Per-partition only |
| Scaling | Limited | Horizontal via partitions |
How do partitions affect message ordering?
Kafka guarantees message ordering only within a single partition, not across an entire topic. When a producer sends messages with the same key, Kafka hashes the key to determine the partition, ensuring all messages with that key go to the same partition and are therefore ordered. Messages with different keys may go to different partitions and can be processed in any order relative to each other.
This design enables parallel processing while maintaining ordering where it matters—typically by using entity identifiers as keys so that all events for a single entity are processed in order.
// Messages with same key go to same partition (ordered)
producer.send(new ProducerRecord<>("orders", "customer-123", order1));
producer.send(new ProducerRecord<>("orders", "customer-123", order2));
// order1 always processed before order2 for customer-123
// Messages with different keys may go to different partitions
producer.send(new ProducerRecord<>("orders", "customer-456", order3));
// order3 may be processed before order1 or order2Partition assignment formula (default):
partition = hash(key) % numberOfPartitions
How many partitions should a topic have?
Choosing the right partition count requires balancing several factors. More partitions enable greater parallelism since each partition can be consumed by a separate consumer, but they also increase broker memory usage, extend recovery time after failures, and can add end-to-end latency due to more leader elections and replication overhead.
A common starting point is to calculate the maximum of your expected throughput divided by 10 MB/s (a typical partition throughput) and your desired number of parallel consumers. Monitor actual performance and adjust based on observed throughput, latency, and resource utilization.
Considerations:
- Parallelism: More partitions = more parallel consumers
- Throughput: Each partition can handle ~10 MB/s writes
- Latency: More partitions = more end-to-end latency
- Memory: Each partition uses broker memory
- Recovery time: More partitions = longer recovery after broker failure
Producer Questions
Understanding producer configuration is critical for reliable message delivery.
What are the different acks settings and their trade-offs?
The acks (acknowledgments) setting controls how many partition replicas must confirm receipt before the producer considers a write successful. This is the primary knob for trading off between durability and latency. With acks=0, the producer doesn't wait for any confirmation—fastest but may lose messages. With acks=1, only the leader must acknowledge, which can still lose data if the leader fails before replication. With acks=all, all in-sync replicas must acknowledge, providing the strongest durability guarantee.
For most production use cases, acks=all combined with min.insync.replicas=2 provides the right balance of durability without sacrificing too much performance.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// acks=0: Fire and forget (fastest, least reliable)
props.put("acks", "0");
// acks=1: Leader acknowledgment (balanced)
props.put("acks", "1");
// acks=all: All in-sync replicas acknowledge (slowest, most reliable)
props.put("acks", "all");| Setting | Behavior | Durability | Latency |
|---|---|---|---|
acks=0 | Don't wait for acknowledgment | May lose messages | Lowest |
acks=1 | Wait for leader to write | May lose if leader fails before replication | Medium |
acks=all | Wait for all ISR to write | No loss if ISR > 1 | Highest |
What is an idempotent producer and why is it important?
An idempotent producer ensures exactly-once delivery to a single partition even when retries occur due to network issues or broker failures. Without idempotence, a retry after a timeout could result in duplicate messages if the original write actually succeeded but the acknowledgment was lost. The idempotent producer solves this by assigning each producer instance a unique Producer ID and each message a sequence number, allowing brokers to detect and deduplicate retried messages.
Enabling idempotence is straightforward and has minimal performance overhead, making it a best practice for all production deployments where message duplication would be problematic.
props.put("enable.idempotence", "true");
// Automatically sets:
// - acks=all
// - retries=Integer.MAX_VALUE
// - max.in.flight.requests.per.connection=5How it works:
- Producer gets a unique Producer ID (PID) on initialization
- Each message gets a sequence number
- Broker deduplicates based on PID + sequence number
- Retried messages with same sequence are ignored
How does Kafka batching work and how do you tune it?
Kafka producers batch multiple messages together before sending them to brokers, improving throughput by reducing network round trips and enabling more efficient compression. The batch.size parameter sets the maximum batch size in bytes, while linger.ms controls how long the producer waits for additional messages before sending a partially-filled batch.
With linger.ms=0 (the default), messages are sent immediately without waiting for batching. Increasing this value allows more messages to accumulate, improving throughput at the cost of higher latency. Compression further improves efficiency by reducing network bandwidth and storage requirements at the cost of CPU overhead.
// Batch size in bytes
props.put("batch.size", 16384); // 16 KB default
// Maximum wait time to fill batch
props.put("linger.ms", 5); // Wait up to 5ms for more messages
// Compression
props.put("compression.type", "lz4"); // Options: none, gzip, snappy, lz4, zstdBatching trade-offs:
- Larger batches: Better throughput, higher latency
- Smaller batches: Lower latency, more overhead
- Compression: Reduces network/storage but adds CPU overhead
How do you control which partition a message goes to?
Kafka provides three mechanisms for controlling partition assignment. The default behavior uses key-based partitioning where the hash of the message key determines the partition—this ensures all messages with the same key go to the same partition. For special cases, you can explicitly specify a partition number or implement a custom partitioner that applies business logic to route messages.
Custom partitioners are useful for scenarios like routing high-priority messages to dedicated partitions or implementing geographic routing based on message content.
// 1. Key-based (default): hash(key) % partitions
producer.send(new ProducerRecord<>("topic", "key", "value"));
// 2. Explicit partition
producer.send(new ProducerRecord<>("topic", 2, "key", "value"));
// 3. Custom partitioner
public class CustomPartitioner implements Partitioner {
@Override
public int partition(String topic, Object key, byte[] keyBytes,
Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
// Route VIP customers to dedicated partition
if (key.toString().startsWith("vip-")) {
return 0; // VIP partition
}
// Default: hash-based distribution for others
return Math.abs(key.hashCode()) % (partitions.size() - 1) + 1;
}
}Consumer Questions
Consumer configuration affects reliability, throughput, and exactly-once semantics.
How do consumer groups enable scalable consumption?
Consumer groups provide Kafka's mechanism for parallel message processing with automatic load balancing. When multiple consumers join the same group, Kafka distributes partitions among them so each partition is assigned to exactly one consumer. This enables horizontal scaling—add more consumers to process messages faster, up to the number of partitions.
Different consumer groups operate independently, each maintaining their own offset position and receiving all messages from subscribed topics. This enables the publish-subscribe pattern where multiple applications can independently consume the same data stream.
flowchart TB
subgraph topic["Topic: orders (3 partitions)"]
direction LR
P0["Part 0"]
P1["Part 1"]
P2["Part 2"]
end
subgraph groupA["Consumer Group A"]
direction LR
C1A["Consumer 1<br/>Part 0"]
C2A["Consumer 2<br/>Part 1, 2"]
end
subgraph groupB["Consumer Group B"]
C1B["Consumer 1<br/>Part 0, 1, 2"]
end
P0 --> C1A
P1 --> C2A
P2 --> C2A
P0 --> C1B
P1 --> C1B
P2 --> C1BKey rules:
- Each partition assigned to exactly one consumer per group
- Multiple groups can consume same topic independently
- Max effective consumers = number of partitions
How do consumers track their position with offsets?
Consumers track their progress through each partition using offsets—sequential numbers that identify each message's position. Kafka stores committed offsets in an internal topic (__consumer_offsets), allowing consumers to resume from where they left off after restarts. With auto-commit enabled, offsets are committed periodically in the background, providing at-least-once semantics since a crash between processing and commit can result in reprocessing.
Manual commit provides finer control, allowing you to commit offsets only after successfully processing messages, which is essential for exactly-once processing patterns.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "my-consumer-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
// Auto commit (default, at-least-once)
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "5000");
// Manual commit for more control
props.put("enable.auto.commit", "false");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("orders"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processRecord(record);
}
// Synchronous commit (blocks until confirmed)
consumer.commitSync();
// Or async commit (non-blocking)
consumer.commitAsync((offsets, exception) -> {
if (exception != null) {
log.error("Commit failed", exception);
}
});
}What happens when a new consumer joins a group?
When a consumer joins or leaves a group, Kafka triggers a rebalance to redistribute partitions among active consumers. During rebalance, all consumers in the group temporarily stop processing while the group coordinator reassigns partitions. After reassignment, consumers resume from their last committed offsets, which may result in some message reprocessing if offsets weren't committed before the rebalance.
Cooperative (incremental) rebalancing, introduced in Kafka 2.4, minimizes disruption by only reassigning affected partitions rather than stopping all consumers.
// Enable cooperative rebalancing
props.put("partition.assignment.strategy",
"org.apache.kafka.clients.consumer.CooperativeStickyAssignor");Rebalance strategies:
- Eager: Stop all, reassign all (default before 2.4)
- Cooperative (Incremental): Only reassign affected partitions
Exactly-Once Semantics Questions
Exactly-once processing is critical for applications where duplicate or lost messages cause business problems.
How do you achieve exactly-once processing with Kafka?
Exactly-once semantics in Kafka operates at three levels. The idempotent producer prevents duplicates within a single partition during retries. Transactional producers enable atomic writes across multiple partitions—either all messages in a transaction are visible to consumers, or none are. End-to-end exactly-once combines both with read_committed isolation level to create read-process-write patterns where consuming, processing, and producing are atomic.
Each level builds on the previous one, with the full exactly-once pattern requiring coordination between consumer offsets and producer transactions.
Level 1: Idempotent producer (producer → Kafka)
props.put("enable.idempotence", "true");Level 2: Transactional producer (across partitions)
props.put("transactional.id", "my-transactional-id");
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("topic1", "key", "value1"));
producer.send(new ProducerRecord<>("topic2", "key", "value2"));
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}Level 3: Read-process-write (end-to-end)
// Consumer reads with isolation
props.put("isolation.level", "read_committed");
// Process and produce atomically
producer.beginTransaction();
for (ConsumerRecord<String, String> record : records) {
ProducerRecord<String, String> output = process(record);
producer.send(output);
}
// Commit consumer offsets as part of transaction
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();Kafka Streams Questions
Stream processing allows real-time transformations on Kafka data.
What is Kafka Streams and how does it differ from other stream processors?
Kafka Streams is a lightweight Java library for building stream processing applications, not a separate cluster infrastructure like Apache Flink or Spark Streaming. Applications built with Kafka Streams run as standard JVM applications—you simply start multiple instances to scale horizontally, and Kafka handles partition assignment and rebalancing automatically. This eliminates the operational overhead of managing a separate stream processing cluster.
The library provides exactly-once processing semantics, automatic state management with fault-tolerant local stores, and seamless integration with Kafka's partitioning model for parallel processing.
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
StreamsBuilder builder = new StreamsBuilder();
// Read from topic as stream
KStream<String, Order> orders = builder.stream("orders");
// Transform: filter, map, flatMap
KStream<String, Order> highValueOrders = orders
.filter((key, order) -> order.getAmount() > 1000)
.mapValues(order -> enrichOrder(order));
// Write to output topic
highValueOrders.to("high-value-orders");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();Advantages:
- No separate cluster needed
- Exactly-once processing built-in
- Elastic scaling (just start more instances)
- Fault-tolerant with automatic recovery
What is the difference between KStream and KTable?
KStream and KTable represent two different ways of interpreting data in Kafka Streams. A KStream treats each record as an independent event in an append-only log—every record matters, even multiple records with the same key. A KTable interprets records as updates to a changelog, maintaining only the latest value for each key as current state.
This distinction is crucial for joins: joining a stream with a table enriches events with the latest lookup value, while joining two streams correlates events that occur within a time window.
// KStream: Append-only stream of events
// Each record is independent
KStream<String, PageView> pageViews = builder.stream("page-views");
// Records: (user1, view1), (user1, view2), (user2, view1)
// KTable: Changelog stream, latest value per key
// Represents current state
KTable<String, UserProfile> users = builder.table("users");
// State: {user1: profile1, user2: profile2}
// Join stream with table (enrichment)
KStream<String, EnrichedPageView> enriched = pageViews.join(
users,
(pageView, profile) -> new EnrichedPageView(pageView, profile)
);| Aspect | KStream | KTable |
|---|---|---|
| Semantics | Event stream | Changelog / state |
| Records | All events | Latest per key |
| Memory | Stateless | Stores state |
| Operations | map, filter, flatMap, join | aggregate, reduce, join |
How do you perform windowed aggregations in Kafka Streams?
Windowed aggregations group events by time intervals, essential for computing metrics like "orders per hour" or "page views in the last 5 minutes." Kafka Streams supports three window types: tumbling windows are fixed-size, non-overlapping intervals; hopping windows are fixed-size but overlap; and session windows group events by activity periods separated by gaps of inactivity.
Each window type serves different use cases—tumbling for hourly reports, hopping for moving averages, and session for user activity analysis.
KStream<String, Transaction> transactions = builder.stream("transactions");
// Tumbling window: fixed, non-overlapping
KTable<Windowed<String>, Long> hourlyCounts = transactions
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
.count();
// Hopping window: fixed size, overlapping
KTable<Windowed<String>, Long> slidingCounts = transactions
.groupByKey()
.windowedBy(TimeWindows.ofSizeAndGrace(
Duration.ofMinutes(5),
Duration.ofMinutes(1)
).advanceBy(Duration.ofMinutes(1)))
.count();
// Session window: activity-based, variable size
KTable<Windowed<String>, Long> sessionCounts = transactions
.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapWithNoGrace(Duration.ofMinutes(30)))
.count();ksqlDB Questions
ksqlDB provides SQL-based stream processing for teams that prefer declarative syntax.
When would you use ksqlDB instead of Kafka Streams?
ksqlDB is a streaming SQL engine that runs as a separate server, allowing you to write stream processing logic using familiar SQL syntax rather than Java code. This makes it ideal for rapid prototyping, simpler transformations, and teams with strong SQL expertise but less Java experience. ksqlDB also provides interactive querying capabilities for exploring streaming data.
However, for complex business logic, custom serialization, or when embedding stream processing directly in an application, Kafka Streams offers more flexibility and control.
-- Create a stream from a topic
CREATE STREAM orders (
order_id VARCHAR KEY,
customer_id VARCHAR,
amount DECIMAL(10,2),
order_time TIMESTAMP
) WITH (
KAFKA_TOPIC='orders',
VALUE_FORMAT='JSON'
);
-- Real-time aggregation
CREATE TABLE hourly_revenue AS
SELECT
customer_id,
WINDOWSTART as window_start,
SUM(amount) as total_revenue,
COUNT(*) as order_count
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY customer_id
EMIT CHANGES;
-- Join streams
CREATE STREAM enriched_orders AS
SELECT
o.order_id,
o.amount,
c.name as customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
EMIT CHANGES;Use ksqlDB when:
- SQL is preferred over Java code
- Rapid prototyping needed
- Simpler transformations and aggregations
- Team has SQL expertise
Use Kafka Streams when:
- Complex business logic required
- Need full programmatic control
- Custom serialization/processing
- Embedding in existing application
Spring Kafka Questions
Spring Kafka simplifies Kafka integration in Spring Boot applications.
How do you configure Kafka in Spring Boot?
Spring Boot provides auto-configuration for Kafka through application properties, eliminating most boilerplate code. You specify bootstrap servers, serializers, and consumer group settings in YAML or properties files, and Spring creates the necessary producer and consumer factories automatically. For JSON serialization, Spring Kafka provides built-in serializers that handle object-to-JSON conversion with configurable trusted packages for deserialization security.
# application.yml
spring:
kafka:
bootstrap-servers: localhost:9092
producer:
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
acks: all
retries: 3
consumer:
group-id: my-group
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
auto-offset-reset: earliest
properties:
spring.json.trusted.packages: com.example.dtoHow do you produce messages with KafkaTemplate?
KafkaTemplate is Spring Kafka's primary abstraction for sending messages, providing both synchronous and asynchronous sending options. The fire-and-forget approach is simplest but provides no feedback on success or failure. Adding callbacks enables logging and error handling for asynchronous sends, while the synchronous approach blocks until the broker acknowledges receipt—useful when you need guaranteed delivery before proceeding.
@Service
@RequiredArgsConstructor
public class OrderProducer {
private final KafkaTemplate<String, Order> kafkaTemplate;
// Fire and forget
public void sendOrder(Order order) {
kafkaTemplate.send("orders", order.getId(), order);
}
// With callback
public void sendOrderWithCallback(Order order) {
CompletableFuture<SendResult<String, Order>> future =
kafkaTemplate.send("orders", order.getId(), order);
future.whenComplete((result, ex) -> {
if (ex == null) {
log.info("Sent order {} to partition {} offset {}",
order.getId(),
result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
} else {
log.error("Failed to send order {}", order.getId(), ex);
}
});
}
// Synchronous (blocking)
public void sendOrderSync(Order order) throws Exception {
kafkaTemplate.send("orders", order.getId(), order).get(10, TimeUnit.SECONDS);
}
}How do you consume messages with @KafkaListener?
The @KafkaListener annotation declaratively configures message consumption, automatically deserializing messages and invoking the annotated method for each record. You can access message metadata through parameter annotations, process batches for higher throughput, or use manual acknowledgment for precise control over offset commits.
Manual acknowledgment is essential for exactly-once processing patterns where you want to commit offsets only after successful processing.
@Service
@Slf4j
public class OrderConsumer {
// Basic listener
@KafkaListener(topics = "orders", groupId = "order-processor")
public void processOrder(Order order) {
log.info("Received order: {}", order.getId());
// Process order
}
// With metadata
@KafkaListener(topics = "orders", groupId = "order-processor")
public void processOrderWithMetadata(
@Payload Order order,
@Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
@Header(KafkaHeaders.OFFSET) long offset,
@Header(KafkaHeaders.RECEIVED_TIMESTAMP) long timestamp) {
log.info("Received order {} from partition {} at offset {}",
order.getId(), partition, offset);
}
// Batch processing
@KafkaListener(topics = "orders", groupId = "batch-processor",
containerFactory = "batchFactory")
public void processOrderBatch(List<Order> orders) {
log.info("Received batch of {} orders", orders.size());
orders.forEach(this::processOrder);
}
// Manual acknowledgment
@KafkaListener(topics = "orders", groupId = "manual-ack")
public void processWithManualAck(Order order, Acknowledgment ack) {
try {
processOrder(order);
ack.acknowledge(); // Commit offset only after successful processing
} catch (Exception e) {
// Don't ack - message will be redelivered
throw e;
}
}
}How do you handle errors and retries in Spring Kafka?
Spring Kafka provides flexible error handling through DefaultErrorHandler with configurable retry policies and dead letter topic publishing. Failed messages can be automatically retried with backoff delays before being sent to a dead letter topic for manual investigation. The @RetryableTopic annotation simplifies this further by automatically creating retry topics with exponential backoff.
@Configuration
public class KafkaConfig {
@Bean
public ConcurrentKafkaListenerContainerFactory<String, Order> kafkaListenerContainerFactory(
ConsumerFactory<String, Order> consumerFactory) {
ConcurrentKafkaListenerContainerFactory<String, Order> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory);
// Retry configuration
factory.setCommonErrorHandler(new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate),
new FixedBackOff(1000L, 3) // 3 retries, 1 second apart
));
return factory;
}
}
// Or use retry topics (Spring Kafka 2.7+)
@RetryableTopic(
attempts = "3",
backoff = @Backoff(delay = 1000, multiplier = 2),
dltTopicSuffix = ".DLT",
autoCreateTopics = "true"
)
@KafkaListener(topics = "orders")
public void processWithRetryTopic(Order order) {
// Failures automatically sent to orders-retry-0, orders-retry-1, then orders.DLT
processOrder(order);
}
@DltHandler
public void handleDlt(Order order) {
log.error("Message exhausted retries: {}", order);
// Alert, store for manual review, etc.
}Replication and Reliability Questions
Production Kafka deployments require careful configuration for reliability and fault tolerance.
How does Kafka replication work?
Kafka replication provides fault tolerance by maintaining multiple copies of each partition across different brokers. Each partition has one leader replica that handles all client reads and writes, while follower replicas passively replicate data from the leader. The set of replicas that are caught up with the leader form the In-Sync Replica set (ISR), which determines which replicas are eligible to become leader if the current leader fails.
Followers that fall behind (due to network issues or slow disks) are removed from the ISR until they catch up, ensuring that leader election only promotes fully synchronized replicas.
flowchart LR
subgraph B1["Broker 1 (Leader)"]
L1["Message 1"]
L2["Message 2"]
L3["Message 3<br/>(committed)"]
end
subgraph B2["Broker 2 (Follower)"]
F1["Message 1"]
F2["Message 2"]
F3["Message 3<br/>(in ISR)"]
end
subgraph B3["Broker 3 (Follower)"]
R1["Message 1"]
R2["Message 2"]
R3["(catching up)<br/>(not in ISR)"]
end
B2 -->|"replicates"| B1
B3 -.->|"catching up"| B1In-Sync Replicas (ISR): Replicas that are:
- Connected to Zookeeper/controller
- Fetching from leader within
replica.lag.time.max.ms
What is min.insync.replicas and why is it important?
The min.insync.replicas setting defines the minimum number of replicas that must acknowledge a write when using acks=all. This prevents data loss in scenarios where most replicas are unavailable—if the ISR falls below this threshold, the broker rejects writes with a NOT_ENOUGH_REPLICAS error rather than accepting writes that can't be adequately replicated.
The standard production configuration uses replication factor 3 with min.insync.replicas=2, allowing the cluster to tolerate one broker failure while still accepting writes, and tolerating a second failure for reads.
# Topic configuration
min.insync.replicas=2With acks=all and min.insync.replicas=2:
- Write succeeds only if at least 2 replicas acknowledge
- If ISR drops below 2, writes are rejected (NOT_ENOUGH_REPLICAS)
- Prevents data loss even if one broker fails
Common configuration for durability:
replication.factor=3
min.insync.replicas=2
acks=allPerformance Tuning Questions
Optimizing Kafka for your specific workload requires understanding the throughput-latency trade-off.
How do you optimize Kafka for high throughput?
High throughput optimization focuses on maximizing messages per second through larger batches, compression, and parallel processing. Larger batch sizes with increased linger time allow producers to accumulate more messages before sending, reducing per-message overhead. Compression (especially LZ4 or zstd) reduces network bandwidth at the cost of CPU. Consumer-side tuning focuses on fetching more data per request and processing larger batches.
These settings increase latency since messages wait longer before being sent or fetched, so they're appropriate for batch processing workloads rather than real-time applications.
Producer tuning:
# Larger batches
batch.size=65536
linger.ms=10
# Compression
compression.type=lz4
# More in-flight requests (with idempotence)
max.in.flight.requests.per.connection=5
# Larger buffer
buffer.memory=67108864Consumer tuning:
# Fetch more data per request
fetch.min.bytes=1048576
fetch.max.wait.ms=500
# Larger fetch size
max.partition.fetch.bytes=1048576
# Process in batches
max.poll.records=500Broker tuning:
# More I/O threads
num.io.threads=16
num.network.threads=8
# Socket buffers
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576How do you optimize Kafka for low latency?
Low latency optimization prioritizes fast message delivery over throughput efficiency. Setting linger.ms=0 sends messages immediately without waiting for batching. Consumer-side, fetch.max.wait.ms=0 ensures brokers return data immediately rather than waiting for more messages to accumulate. These settings reduce the time between a producer sending a message and a consumer receiving it, but result in more network requests and lower overall throughput.
For the lowest latency, also consider using SSD storage on brokers and ensuring adequate network bandwidth.
# Producer: Send immediately
linger.ms=0
batch.size=16384
# Consumer: Fetch frequently
fetch.max.wait.ms=0
fetch.min.bytes=1
# Broker: Faster flushing
log.flush.interval.messages=1 # Careful: impacts durabilityTrade-off: Lower latency typically means lower throughput.
Monitoring and Operations Questions
Operational excellence is key to running Kafka in production.
What metrics should you monitor for Kafka?
Effective Kafka monitoring covers brokers, producers, and consumers. Broker metrics track cluster health—under-replicated partitions indicate replication problems, while offline partitions mean data unavailability. Producer metrics reveal send performance and errors. Consumer metrics, especially lag, show whether consumers are keeping up with the data flow.
Set up alerts on critical metrics like under-replicated partitions, offline partitions, and consumer lag exceeding thresholds to catch problems before they impact applications.
Broker metrics:
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec- Message ratekafka.server:type=ReplicaManager,name=UnderReplicatedPartitions- Replication healthkafka.controller:type=KafkaController,name=OfflinePartitionsCount- Availabilitykafka.network:type=RequestMetrics,name=RequestsPerSec- Request rate
Producer metrics:
record-send-rate- Messages sent per secondrecord-error-rate- Failed sendsrequest-latency-avg- Average request latencybatch-size-avg- Average batch size
Consumer metrics:
records-consumed-rate- Consumption raterecords-lag-max- Maximum lag across partitionscommit-latency-avg- Offset commit latencyrebalance-rate-and-time- Rebalance frequency
How do you detect and troubleshoot consumer lag?
Consumer lag is the difference between the latest offset in a partition and the consumer's committed offset, representing how far behind the consumer is from real-time. Persistent or growing lag indicates consumers can't keep up with the rate of incoming messages. Diagnosing the cause requires examining whether processing is slow, there are too few consumers, or external factors like database writes or network issues are creating bottlenecks.
# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-consumer-group
# Output:
# TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# orders 0 1000 1500 500
# orders 1 2000 2100 100Causes of consumer lag:
- Slow processing: Optimize consumer logic or increase parallelism
- Insufficient consumers: Add more consumers (up to partition count)
- Network issues: Check connectivity and bandwidth
- Rebalancing storms: Use cooperative rebalancing, increase
session.timeout.ms - GC pauses: Tune JVM garbage collection
What is Schema Registry and why is it important?
Schema Registry provides centralized schema management for Kafka, storing schemas for message keys and values and enforcing compatibility rules during evolution. Without Schema Registry, producers and consumers must coordinate schema changes manually, risking deserialization failures when schemas don't match. With Schema Registry, schemas are versioned and checked for compatibility before being registered, preventing breaking changes.
The registry supports multiple serialization formats including Avro, Protobuf, and JSON Schema, with Avro being the most common choice due to its compact binary format and strong schema evolution support.
// Producer with Avro and Schema Registry
props.put("schema.registry.url", "http://localhost:8081");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
// Consumer
props.put("value.deserializer", "io.confluent.kafka.serializers.KafkaAvroDeserializer");
props.put("specific.avro.reader", "true");Compatibility modes:
- BACKWARD: New schema can read old data
- FORWARD: Old schema can read new data
- FULL: Both backward and forward compatible
- NONE: No compatibility checks
What are common Kafka production issues and their solutions?
Production Kafka clusters encounter predictable categories of issues that operators should be prepared to diagnose and resolve. Under-replicated partitions typically indicate broker problems—check disk I/O, network connectivity, and JVM health. Consumer lag requires analyzing whether the bottleneck is processing speed, consumer count, or external dependencies. Rebalancing storms often result from unstable consumers with heartbeat timeouts; increasing session timeouts and using cooperative rebalancing helps.
| Issue | Symptoms | Solution |
|---|---|---|
| Under-replicated partitions | Alerts, slow writes | Check broker health, network, disk I/O |
| Consumer lag | Growing lag metrics | Scale consumers, optimize processing |
| Rebalancing storms | Frequent rebalances, high latency | Increase timeouts, use cooperative assignor |
| Disk full | Write failures | Add retention policies, expand storage |
| Uneven partition distribution | Some brokers overloaded | Run partition reassignment |
| Producer timeouts | Send failures | Check broker health, tune timeouts |
Quick Reference
| Topic | Key Points |
|---|---|
| Architecture | Brokers, topics, partitions, replicas, consumer groups |
| Ordering | Guaranteed within partition only |
| Delivery guarantees | acks=0/1/all, idempotent producer, transactions |
| Consumer groups | Load balancing, rebalancing, offset management |
| Exactly-once | Idempotent + transactional producer + read_committed |
| Kafka Streams | Library, KStream vs KTable, windowing |
| Replication | ISR, min.insync.replicas, leader election |
| Monitoring | Lag, under-replicated partitions, throughput |
Related Articles
- Microservices Architecture Interview Guide - Kafka in microservices context
- System Design Interview Guide - Distributed systems fundamentals
- Spring Boot Interview Guide - Spring Kafka integration
- Complete Java Backend Developer Interview Guide - Full Java backend preparation
