45+ Apache Kafka Interview Questions 2025: Streams, Partitions & Spring Integration

·26 min read
kafkaevent-streamingjavamicroservicesdistributed-systemsinterview-preparation

Apache Kafka has become the backbone of modern event-driven architectures. Whether you're building real-time data pipelines, implementing microservices communication, or processing millions of events per second, Kafka knowledge is essential for backend and data engineering roles.

This guide covers the Kafka concepts that interviewers actually ask about—from core fundamentals to production operational knowledge.

Table of Contents

  1. Kafka Architecture Questions
  2. Topic and Partition Questions
  3. Producer Questions
  4. Consumer Questions
  5. Exactly-Once Semantics Questions
  6. Kafka Streams Questions
  7. ksqlDB Questions
  8. Spring Kafka Questions
  9. Replication and Reliability Questions
  10. Performance Tuning Questions
  11. Monitoring and Operations Questions

Kafka Architecture Questions

Understanding Kafka's architecture is the foundation for all advanced topics.

What are the main components of Kafka architecture?

Kafka's architecture consists of several interconnected components that work together to provide distributed, fault-tolerant event streaming. At its core, a Kafka cluster contains multiple brokers that store data and serve client requests. Topics organize messages into logical categories, while partitions enable horizontal scaling and parallel processing.

Each partition has a leader replica that handles all reads and writes, with follower replicas providing fault tolerance. Consumer groups coordinate multiple consumers to process messages in parallel while ensuring each message is processed exactly once within the group.

flowchart TB
    subgraph cluster["Kafka Cluster"]
        direction LR
        subgraph B1["Broker 1"]
            P0L["Topic A Part 0<br/>(Leader)"]
            P1R["Topic A Part 1<br/>(Replica)"]
        end
        subgraph B2["Broker 2"]
            P1L["Topic A Part 1<br/>(Leader)"]
            P0R1["Topic A Part 0<br/>(Replica)"]
        end
        subgraph B3["Broker 3"]
            P2L["Topic A Part 2<br/>(Leader)"]
            P0R2["Topic A Part 0<br/>(Replica)"]
        end
    end
 
    Producer["Producer"] --> cluster
    cluster --> CG["Consumer<br/>Group"]

Key components:

  • Broker: A Kafka server that stores data and serves clients
  • Topic: A category/feed name to which records are published
  • Partition: An ordered, immutable sequence of records within a topic
  • Replica: A copy of a partition for fault tolerance
  • Leader: The replica that handles all reads and writes for a partition
  • Consumer Group: A set of consumers that cooperate to consume a topic

What is Zookeeper's role in Kafka and what is KRaft?

Zookeeper traditionally served as Kafka's coordination service, managing cluster membership, topic configuration, partition metadata, controller election, and access control lists. However, this dependency added operational complexity and limited Kafka's scalability to around 200,000 partitions per cluster.

KRaft (Kafka Raft) is the new metadata management system that eliminates the Zookeeper dependency entirely. It uses a Raft-based consensus protocol where some brokers act as controllers, storing metadata in an internal Kafka topic. This simplification became production-ready in Kafka 3.3.

# KRaft mode configuration (Kafka 3.3+)
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@localhost:9093,2@localhost:9094,3@localhost:9095

Benefits of KRaft:

  • Simplified operations (one system instead of two)
  • Better scalability (millions of partitions)
  • Faster controller failover
  • Reduced operational complexity

Interview tip: Know that KRaft is production-ready as of Kafka 3.3 and Zookeeper is deprecated.


Topic and Partition Questions

Topics and partitions are fundamental to understanding how Kafka scales and maintains ordering guarantees.

What is a Kafka topic and how does it differ from a traditional message queue?

A Kafka topic is a logical channel for publishing and subscribing to streams of records, functioning as the primary abstraction for organizing data. Unlike traditional message queues where messages are deleted after consumption, Kafka retains messages based on configurable retention policies, allowing multiple consumers to read the same data and enabling replay from any point in time.

This fundamental difference in data retention model makes Kafka suitable for use cases like event sourcing, audit logging, and stream processing where historical data access is valuable.

AspectTraditional QueueKafka Topic
Message retentionDeleted after consumptionRetained based on policy
Multiple consumersMessage goes to one consumerAll consumer groups get all messages
Replay capabilityNot possibleCan replay from any offset
OrderingQueue-wide (typically)Per-partition only
ScalingLimitedHorizontal via partitions

How do partitions affect message ordering?

Kafka guarantees message ordering only within a single partition, not across an entire topic. When a producer sends messages with the same key, Kafka hashes the key to determine the partition, ensuring all messages with that key go to the same partition and are therefore ordered. Messages with different keys may go to different partitions and can be processed in any order relative to each other.

This design enables parallel processing while maintaining ordering where it matters—typically by using entity identifiers as keys so that all events for a single entity are processed in order.

// Messages with same key go to same partition (ordered)
producer.send(new ProducerRecord<>("orders", "customer-123", order1));
producer.send(new ProducerRecord<>("orders", "customer-123", order2));
// order1 always processed before order2 for customer-123
 
// Messages with different keys may go to different partitions
producer.send(new ProducerRecord<>("orders", "customer-456", order3));
// order3 may be processed before order1 or order2

Partition assignment formula (default):

partition = hash(key) % numberOfPartitions

How many partitions should a topic have?

Choosing the right partition count requires balancing several factors. More partitions enable greater parallelism since each partition can be consumed by a separate consumer, but they also increase broker memory usage, extend recovery time after failures, and can add end-to-end latency due to more leader elections and replication overhead.

A common starting point is to calculate the maximum of your expected throughput divided by 10 MB/s (a typical partition throughput) and your desired number of parallel consumers. Monitor actual performance and adjust based on observed throughput, latency, and resource utilization.

Considerations:

  • Parallelism: More partitions = more parallel consumers
  • Throughput: Each partition can handle ~10 MB/s writes
  • Latency: More partitions = more end-to-end latency
  • Memory: Each partition uses broker memory
  • Recovery time: More partitions = longer recovery after broker failure

Producer Questions

Understanding producer configuration is critical for reliable message delivery.

What are the different acks settings and their trade-offs?

The acks (acknowledgments) setting controls how many partition replicas must confirm receipt before the producer considers a write successful. This is the primary knob for trading off between durability and latency. With acks=0, the producer doesn't wait for any confirmation—fastest but may lose messages. With acks=1, only the leader must acknowledge, which can still lose data if the leader fails before replication. With acks=all, all in-sync replicas must acknowledge, providing the strongest durability guarantee.

For most production use cases, acks=all combined with min.insync.replicas=2 provides the right balance of durability without sacrificing too much performance.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 
// acks=0: Fire and forget (fastest, least reliable)
props.put("acks", "0");
 
// acks=1: Leader acknowledgment (balanced)
props.put("acks", "1");
 
// acks=all: All in-sync replicas acknowledge (slowest, most reliable)
props.put("acks", "all");
SettingBehaviorDurabilityLatency
acks=0Don't wait for acknowledgmentMay lose messagesLowest
acks=1Wait for leader to writeMay lose if leader fails before replicationMedium
acks=allWait for all ISR to writeNo loss if ISR > 1Highest

What is an idempotent producer and why is it important?

An idempotent producer ensures exactly-once delivery to a single partition even when retries occur due to network issues or broker failures. Without idempotence, a retry after a timeout could result in duplicate messages if the original write actually succeeded but the acknowledgment was lost. The idempotent producer solves this by assigning each producer instance a unique Producer ID and each message a sequence number, allowing brokers to detect and deduplicate retried messages.

Enabling idempotence is straightforward and has minimal performance overhead, making it a best practice for all production deployments where message duplication would be problematic.

props.put("enable.idempotence", "true");
// Automatically sets:
// - acks=all
// - retries=Integer.MAX_VALUE
// - max.in.flight.requests.per.connection=5

How it works:

  1. Producer gets a unique Producer ID (PID) on initialization
  2. Each message gets a sequence number
  3. Broker deduplicates based on PID + sequence number
  4. Retried messages with same sequence are ignored

How does Kafka batching work and how do you tune it?

Kafka producers batch multiple messages together before sending them to brokers, improving throughput by reducing network round trips and enabling more efficient compression. The batch.size parameter sets the maximum batch size in bytes, while linger.ms controls how long the producer waits for additional messages before sending a partially-filled batch.

With linger.ms=0 (the default), messages are sent immediately without waiting for batching. Increasing this value allows more messages to accumulate, improving throughput at the cost of higher latency. Compression further improves efficiency by reducing network bandwidth and storage requirements at the cost of CPU overhead.

// Batch size in bytes
props.put("batch.size", 16384);  // 16 KB default
 
// Maximum wait time to fill batch
props.put("linger.ms", 5);  // Wait up to 5ms for more messages
 
// Compression
props.put("compression.type", "lz4");  // Options: none, gzip, snappy, lz4, zstd

Batching trade-offs:

  • Larger batches: Better throughput, higher latency
  • Smaller batches: Lower latency, more overhead
  • Compression: Reduces network/storage but adds CPU overhead

How do you control which partition a message goes to?

Kafka provides three mechanisms for controlling partition assignment. The default behavior uses key-based partitioning where the hash of the message key determines the partition—this ensures all messages with the same key go to the same partition. For special cases, you can explicitly specify a partition number or implement a custom partitioner that applies business logic to route messages.

Custom partitioners are useful for scenarios like routing high-priority messages to dedicated partitions or implementing geographic routing based on message content.

// 1. Key-based (default): hash(key) % partitions
producer.send(new ProducerRecord<>("topic", "key", "value"));
 
// 2. Explicit partition
producer.send(new ProducerRecord<>("topic", 2, "key", "value"));
 
// 3. Custom partitioner
public class CustomPartitioner implements Partitioner {
    @Override
    public int partition(String topic, Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes, Cluster cluster) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
 
        // Route VIP customers to dedicated partition
        if (key.toString().startsWith("vip-")) {
            return 0;  // VIP partition
        }
 
        // Default: hash-based distribution for others
        return Math.abs(key.hashCode()) % (partitions.size() - 1) + 1;
    }
}

Consumer Questions

Consumer configuration affects reliability, throughput, and exactly-once semantics.

How do consumer groups enable scalable consumption?

Consumer groups provide Kafka's mechanism for parallel message processing with automatic load balancing. When multiple consumers join the same group, Kafka distributes partitions among them so each partition is assigned to exactly one consumer. This enables horizontal scaling—add more consumers to process messages faster, up to the number of partitions.

Different consumer groups operate independently, each maintaining their own offset position and receiving all messages from subscribed topics. This enables the publish-subscribe pattern where multiple applications can independently consume the same data stream.

flowchart TB
    subgraph topic["Topic: orders (3 partitions)"]
        direction LR
        P0["Part 0"]
        P1["Part 1"]
        P2["Part 2"]
    end
 
    subgraph groupA["Consumer Group A"]
        direction LR
        C1A["Consumer 1<br/>Part 0"]
        C2A["Consumer 2<br/>Part 1, 2"]
    end
 
    subgraph groupB["Consumer Group B"]
        C1B["Consumer 1<br/>Part 0, 1, 2"]
    end
 
    P0 --> C1A
    P1 --> C2A
    P2 --> C2A
 
    P0 --> C1B
    P1 --> C1B
    P2 --> C1B

Key rules:

  • Each partition assigned to exactly one consumer per group
  • Multiple groups can consume same topic independently
  • Max effective consumers = number of partitions

How do consumers track their position with offsets?

Consumers track their progress through each partition using offsets—sequential numbers that identify each message's position. Kafka stores committed offsets in an internal topic (__consumer_offsets), allowing consumers to resume from where they left off after restarts. With auto-commit enabled, offsets are committed periodically in the background, providing at-least-once semantics since a crash between processing and commit can result in reprocessing.

Manual commit provides finer control, allowing you to commit offsets only after successfully processing messages, which is essential for exactly-once processing patterns.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "my-consumer-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
 
// Auto commit (default, at-least-once)
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "5000");
 
// Manual commit for more control
props.put("enable.auto.commit", "false");
 
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("orders"));
 
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processRecord(record);
    }
    // Synchronous commit (blocks until confirmed)
    consumer.commitSync();
 
    // Or async commit (non-blocking)
    consumer.commitAsync((offsets, exception) -> {
        if (exception != null) {
            log.error("Commit failed", exception);
        }
    });
}

What happens when a new consumer joins a group?

When a consumer joins or leaves a group, Kafka triggers a rebalance to redistribute partitions among active consumers. During rebalance, all consumers in the group temporarily stop processing while the group coordinator reassigns partitions. After reassignment, consumers resume from their last committed offsets, which may result in some message reprocessing if offsets weren't committed before the rebalance.

Cooperative (incremental) rebalancing, introduced in Kafka 2.4, minimizes disruption by only reassigning affected partitions rather than stopping all consumers.

// Enable cooperative rebalancing
props.put("partition.assignment.strategy",
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

Rebalance strategies:

  • Eager: Stop all, reassign all (default before 2.4)
  • Cooperative (Incremental): Only reassign affected partitions

Exactly-Once Semantics Questions

Exactly-once processing is critical for applications where duplicate or lost messages cause business problems.

How do you achieve exactly-once processing with Kafka?

Exactly-once semantics in Kafka operates at three levels. The idempotent producer prevents duplicates within a single partition during retries. Transactional producers enable atomic writes across multiple partitions—either all messages in a transaction are visible to consumers, or none are. End-to-end exactly-once combines both with read_committed isolation level to create read-process-write patterns where consuming, processing, and producing are atomic.

Each level builds on the previous one, with the full exactly-once pattern requiring coordination between consumer offsets and producer transactions.

Level 1: Idempotent producer (producer → Kafka)

props.put("enable.idempotence", "true");

Level 2: Transactional producer (across partitions)

props.put("transactional.id", "my-transactional-id");
 
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
producer.initTransactions();
 
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("topic1", "key", "value1"));
    producer.send(new ProducerRecord<>("topic2", "key", "value2"));
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Level 3: Read-process-write (end-to-end)

// Consumer reads with isolation
props.put("isolation.level", "read_committed");
 
// Process and produce atomically
producer.beginTransaction();
for (ConsumerRecord<String, String> record : records) {
    ProducerRecord<String, String> output = process(record);
    producer.send(output);
}
// Commit consumer offsets as part of transaction
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();

Kafka Streams Questions

Stream processing allows real-time transformations on Kafka data.

What is Kafka Streams and how does it differ from other stream processors?

Kafka Streams is a lightweight Java library for building stream processing applications, not a separate cluster infrastructure like Apache Flink or Spark Streaming. Applications built with Kafka Streams run as standard JVM applications—you simply start multiple instances to scale horizontally, and Kafka handles partition assignment and rebalancing automatically. This eliminates the operational overhead of managing a separate stream processing cluster.

The library provides exactly-once processing semantics, automatic state management with fault-tolerant local stores, and seamless integration with Kafka's partitioning model for parallel processing.

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "order-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
 
StreamsBuilder builder = new StreamsBuilder();
 
// Read from topic as stream
KStream<String, Order> orders = builder.stream("orders");
 
// Transform: filter, map, flatMap
KStream<String, Order> highValueOrders = orders
    .filter((key, order) -> order.getAmount() > 1000)
    .mapValues(order -> enrichOrder(order));
 
// Write to output topic
highValueOrders.to("high-value-orders");
 
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Advantages:

  • No separate cluster needed
  • Exactly-once processing built-in
  • Elastic scaling (just start more instances)
  • Fault-tolerant with automatic recovery

What is the difference between KStream and KTable?

KStream and KTable represent two different ways of interpreting data in Kafka Streams. A KStream treats each record as an independent event in an append-only log—every record matters, even multiple records with the same key. A KTable interprets records as updates to a changelog, maintaining only the latest value for each key as current state.

This distinction is crucial for joins: joining a stream with a table enriches events with the latest lookup value, while joining two streams correlates events that occur within a time window.

// KStream: Append-only stream of events
// Each record is independent
KStream<String, PageView> pageViews = builder.stream("page-views");
// Records: (user1, view1), (user1, view2), (user2, view1)
 
// KTable: Changelog stream, latest value per key
// Represents current state
KTable<String, UserProfile> users = builder.table("users");
// State: {user1: profile1, user2: profile2}
 
// Join stream with table (enrichment)
KStream<String, EnrichedPageView> enriched = pageViews.join(
    users,
    (pageView, profile) -> new EnrichedPageView(pageView, profile)
);
AspectKStreamKTable
SemanticsEvent streamChangelog / state
RecordsAll eventsLatest per key
MemoryStatelessStores state
Operationsmap, filter, flatMap, joinaggregate, reduce, join

How do you perform windowed aggregations in Kafka Streams?

Windowed aggregations group events by time intervals, essential for computing metrics like "orders per hour" or "page views in the last 5 minutes." Kafka Streams supports three window types: tumbling windows are fixed-size, non-overlapping intervals; hopping windows are fixed-size but overlap; and session windows group events by activity periods separated by gaps of inactivity.

Each window type serves different use cases—tumbling for hourly reports, hopping for moving averages, and session for user activity analysis.

KStream<String, Transaction> transactions = builder.stream("transactions");
 
// Tumbling window: fixed, non-overlapping
KTable<Windowed<String>, Long> hourlyCounts = transactions
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
    .count();
 
// Hopping window: fixed size, overlapping
KTable<Windowed<String>, Long> slidingCounts = transactions
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeAndGrace(
        Duration.ofMinutes(5),
        Duration.ofMinutes(1)
    ).advanceBy(Duration.ofMinutes(1)))
    .count();
 
// Session window: activity-based, variable size
KTable<Windowed<String>, Long> sessionCounts = transactions
    .groupByKey()
    .windowedBy(SessionWindows.ofInactivityGapWithNoGrace(Duration.ofMinutes(30)))
    .count();

ksqlDB Questions

ksqlDB provides SQL-based stream processing for teams that prefer declarative syntax.

When would you use ksqlDB instead of Kafka Streams?

ksqlDB is a streaming SQL engine that runs as a separate server, allowing you to write stream processing logic using familiar SQL syntax rather than Java code. This makes it ideal for rapid prototyping, simpler transformations, and teams with strong SQL expertise but less Java experience. ksqlDB also provides interactive querying capabilities for exploring streaming data.

However, for complex business logic, custom serialization, or when embedding stream processing directly in an application, Kafka Streams offers more flexibility and control.

-- Create a stream from a topic
CREATE STREAM orders (
    order_id VARCHAR KEY,
    customer_id VARCHAR,
    amount DECIMAL(10,2),
    order_time TIMESTAMP
) WITH (
    KAFKA_TOPIC='orders',
    VALUE_FORMAT='JSON'
);
 
-- Real-time aggregation
CREATE TABLE hourly_revenue AS
SELECT
    customer_id,
    WINDOWSTART as window_start,
    SUM(amount) as total_revenue,
    COUNT(*) as order_count
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY customer_id
EMIT CHANGES;
 
-- Join streams
CREATE STREAM enriched_orders AS
SELECT
    o.order_id,
    o.amount,
    c.name as customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
EMIT CHANGES;

Use ksqlDB when:

  • SQL is preferred over Java code
  • Rapid prototyping needed
  • Simpler transformations and aggregations
  • Team has SQL expertise

Use Kafka Streams when:

  • Complex business logic required
  • Need full programmatic control
  • Custom serialization/processing
  • Embedding in existing application

Spring Kafka Questions

Spring Kafka simplifies Kafka integration in Spring Boot applications.

How do you configure Kafka in Spring Boot?

Spring Boot provides auto-configuration for Kafka through application properties, eliminating most boilerplate code. You specify bootstrap servers, serializers, and consumer group settings in YAML or properties files, and Spring creates the necessary producer and consumer factories automatically. For JSON serialization, Spring Kafka provides built-in serializers that handle object-to-JSON conversion with configurable trusted packages for deserialization security.

# application.yml
spring:
  kafka:
    bootstrap-servers: localhost:9092
    producer:
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
      acks: all
      retries: 3
    consumer:
      group-id: my-group
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
      auto-offset-reset: earliest
      properties:
        spring.json.trusted.packages: com.example.dto

How do you produce messages with KafkaTemplate?

KafkaTemplate is Spring Kafka's primary abstraction for sending messages, providing both synchronous and asynchronous sending options. The fire-and-forget approach is simplest but provides no feedback on success or failure. Adding callbacks enables logging and error handling for asynchronous sends, while the synchronous approach blocks until the broker acknowledges receipt—useful when you need guaranteed delivery before proceeding.

@Service
@RequiredArgsConstructor
public class OrderProducer {
 
    private final KafkaTemplate<String, Order> kafkaTemplate;
 
    // Fire and forget
    public void sendOrder(Order order) {
        kafkaTemplate.send("orders", order.getId(), order);
    }
 
    // With callback
    public void sendOrderWithCallback(Order order) {
        CompletableFuture<SendResult<String, Order>> future =
            kafkaTemplate.send("orders", order.getId(), order);
 
        future.whenComplete((result, ex) -> {
            if (ex == null) {
                log.info("Sent order {} to partition {} offset {}",
                    order.getId(),
                    result.getRecordMetadata().partition(),
                    result.getRecordMetadata().offset());
            } else {
                log.error("Failed to send order {}", order.getId(), ex);
            }
        });
    }
 
    // Synchronous (blocking)
    public void sendOrderSync(Order order) throws Exception {
        kafkaTemplate.send("orders", order.getId(), order).get(10, TimeUnit.SECONDS);
    }
}

How do you consume messages with @KafkaListener?

The @KafkaListener annotation declaratively configures message consumption, automatically deserializing messages and invoking the annotated method for each record. You can access message metadata through parameter annotations, process batches for higher throughput, or use manual acknowledgment for precise control over offset commits.

Manual acknowledgment is essential for exactly-once processing patterns where you want to commit offsets only after successful processing.

@Service
@Slf4j
public class OrderConsumer {
 
    // Basic listener
    @KafkaListener(topics = "orders", groupId = "order-processor")
    public void processOrder(Order order) {
        log.info("Received order: {}", order.getId());
        // Process order
    }
 
    // With metadata
    @KafkaListener(topics = "orders", groupId = "order-processor")
    public void processOrderWithMetadata(
            @Payload Order order,
            @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
            @Header(KafkaHeaders.OFFSET) long offset,
            @Header(KafkaHeaders.RECEIVED_TIMESTAMP) long timestamp) {
        log.info("Received order {} from partition {} at offset {}",
            order.getId(), partition, offset);
    }
 
    // Batch processing
    @KafkaListener(topics = "orders", groupId = "batch-processor",
                   containerFactory = "batchFactory")
    public void processOrderBatch(List<Order> orders) {
        log.info("Received batch of {} orders", orders.size());
        orders.forEach(this::processOrder);
    }
 
    // Manual acknowledgment
    @KafkaListener(topics = "orders", groupId = "manual-ack")
    public void processWithManualAck(Order order, Acknowledgment ack) {
        try {
            processOrder(order);
            ack.acknowledge();  // Commit offset only after successful processing
        } catch (Exception e) {
            // Don't ack - message will be redelivered
            throw e;
        }
    }
}

How do you handle errors and retries in Spring Kafka?

Spring Kafka provides flexible error handling through DefaultErrorHandler with configurable retry policies and dead letter topic publishing. Failed messages can be automatically retried with backoff delays before being sent to a dead letter topic for manual investigation. The @RetryableTopic annotation simplifies this further by automatically creating retry topics with exponential backoff.

@Configuration
public class KafkaConfig {
 
    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, Order> kafkaListenerContainerFactory(
            ConsumerFactory<String, Order> consumerFactory) {
 
        ConcurrentKafkaListenerContainerFactory<String, Order> factory =
            new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory);
 
        // Retry configuration
        factory.setCommonErrorHandler(new DefaultErrorHandler(
            new DeadLetterPublishingRecoverer(kafkaTemplate),
            new FixedBackOff(1000L, 3)  // 3 retries, 1 second apart
        ));
 
        return factory;
    }
}
 
// Or use retry topics (Spring Kafka 2.7+)
@RetryableTopic(
    attempts = "3",
    backoff = @Backoff(delay = 1000, multiplier = 2),
    dltTopicSuffix = ".DLT",
    autoCreateTopics = "true"
)
@KafkaListener(topics = "orders")
public void processWithRetryTopic(Order order) {
    // Failures automatically sent to orders-retry-0, orders-retry-1, then orders.DLT
    processOrder(order);
}
 
@DltHandler
public void handleDlt(Order order) {
    log.error("Message exhausted retries: {}", order);
    // Alert, store for manual review, etc.
}

Replication and Reliability Questions

Production Kafka deployments require careful configuration for reliability and fault tolerance.

How does Kafka replication work?

Kafka replication provides fault tolerance by maintaining multiple copies of each partition across different brokers. Each partition has one leader replica that handles all client reads and writes, while follower replicas passively replicate data from the leader. The set of replicas that are caught up with the leader form the In-Sync Replica set (ISR), which determines which replicas are eligible to become leader if the current leader fails.

Followers that fall behind (due to network issues or slow disks) are removed from the ISR until they catch up, ensuring that leader election only promotes fully synchronized replicas.

flowchart LR
    subgraph B1["Broker 1 (Leader)"]
        L1["Message 1"]
        L2["Message 2"]
        L3["Message 3<br/>(committed)"]
    end
 
    subgraph B2["Broker 2 (Follower)"]
        F1["Message 1"]
        F2["Message 2"]
        F3["Message 3<br/>(in ISR)"]
    end
 
    subgraph B3["Broker 3 (Follower)"]
        R1["Message 1"]
        R2["Message 2"]
        R3["(catching up)<br/>(not in ISR)"]
    end
 
    B2 -->|"replicates"| B1
    B3 -.->|"catching up"| B1

In-Sync Replicas (ISR): Replicas that are:

  • Connected to Zookeeper/controller
  • Fetching from leader within replica.lag.time.max.ms

What is min.insync.replicas and why is it important?

The min.insync.replicas setting defines the minimum number of replicas that must acknowledge a write when using acks=all. This prevents data loss in scenarios where most replicas are unavailable—if the ISR falls below this threshold, the broker rejects writes with a NOT_ENOUGH_REPLICAS error rather than accepting writes that can't be adequately replicated.

The standard production configuration uses replication factor 3 with min.insync.replicas=2, allowing the cluster to tolerate one broker failure while still accepting writes, and tolerating a second failure for reads.

# Topic configuration
min.insync.replicas=2

With acks=all and min.insync.replicas=2:

  • Write succeeds only if at least 2 replicas acknowledge
  • If ISR drops below 2, writes are rejected (NOT_ENOUGH_REPLICAS)
  • Prevents data loss even if one broker fails

Common configuration for durability:

replication.factor=3
min.insync.replicas=2
acks=all

Performance Tuning Questions

Optimizing Kafka for your specific workload requires understanding the throughput-latency trade-off.

How do you optimize Kafka for high throughput?

High throughput optimization focuses on maximizing messages per second through larger batches, compression, and parallel processing. Larger batch sizes with increased linger time allow producers to accumulate more messages before sending, reducing per-message overhead. Compression (especially LZ4 or zstd) reduces network bandwidth at the cost of CPU. Consumer-side tuning focuses on fetching more data per request and processing larger batches.

These settings increase latency since messages wait longer before being sent or fetched, so they're appropriate for batch processing workloads rather than real-time applications.

Producer tuning:

# Larger batches
batch.size=65536
linger.ms=10
 
# Compression
compression.type=lz4
 
# More in-flight requests (with idempotence)
max.in.flight.requests.per.connection=5
 
# Larger buffer
buffer.memory=67108864

Consumer tuning:

# Fetch more data per request
fetch.min.bytes=1048576
fetch.max.wait.ms=500
 
# Larger fetch size
max.partition.fetch.bytes=1048576
 
# Process in batches
max.poll.records=500

Broker tuning:

# More I/O threads
num.io.threads=16
num.network.threads=8
 
# Socket buffers
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576

How do you optimize Kafka for low latency?

Low latency optimization prioritizes fast message delivery over throughput efficiency. Setting linger.ms=0 sends messages immediately without waiting for batching. Consumer-side, fetch.max.wait.ms=0 ensures brokers return data immediately rather than waiting for more messages to accumulate. These settings reduce the time between a producer sending a message and a consumer receiving it, but result in more network requests and lower overall throughput.

For the lowest latency, also consider using SSD storage on brokers and ensuring adequate network bandwidth.

# Producer: Send immediately
linger.ms=0
batch.size=16384
 
# Consumer: Fetch frequently
fetch.max.wait.ms=0
fetch.min.bytes=1
 
# Broker: Faster flushing
log.flush.interval.messages=1  # Careful: impacts durability

Trade-off: Lower latency typically means lower throughput.


Monitoring and Operations Questions

Operational excellence is key to running Kafka in production.

What metrics should you monitor for Kafka?

Effective Kafka monitoring covers brokers, producers, and consumers. Broker metrics track cluster health—under-replicated partitions indicate replication problems, while offline partitions mean data unavailability. Producer metrics reveal send performance and errors. Consumer metrics, especially lag, show whether consumers are keeping up with the data flow.

Set up alerts on critical metrics like under-replicated partitions, offline partitions, and consumer lag exceeding thresholds to catch problems before they impact applications.

Broker metrics:

  • kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec - Message rate
  • kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions - Replication health
  • kafka.controller:type=KafkaController,name=OfflinePartitionsCount - Availability
  • kafka.network:type=RequestMetrics,name=RequestsPerSec - Request rate

Producer metrics:

  • record-send-rate - Messages sent per second
  • record-error-rate - Failed sends
  • request-latency-avg - Average request latency
  • batch-size-avg - Average batch size

Consumer metrics:

  • records-consumed-rate - Consumption rate
  • records-lag-max - Maximum lag across partitions
  • commit-latency-avg - Offset commit latency
  • rebalance-rate-and-time - Rebalance frequency

How do you detect and troubleshoot consumer lag?

Consumer lag is the difference between the latest offset in a partition and the consumer's committed offset, representing how far behind the consumer is from real-time. Persistent or growing lag indicates consumers can't keep up with the rate of incoming messages. Diagnosing the cause requires examining whether processing is slow, there are too few consumers, or external factors like database writes or network issues are creating bottlenecks.

# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --describe --group my-consumer-group
 
# Output:
# TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# orders   0          1000            1500            500
# orders   1          2000            2100            100

Causes of consumer lag:

  1. Slow processing: Optimize consumer logic or increase parallelism
  2. Insufficient consumers: Add more consumers (up to partition count)
  3. Network issues: Check connectivity and bandwidth
  4. Rebalancing storms: Use cooperative rebalancing, increase session.timeout.ms
  5. GC pauses: Tune JVM garbage collection

What is Schema Registry and why is it important?

Schema Registry provides centralized schema management for Kafka, storing schemas for message keys and values and enforcing compatibility rules during evolution. Without Schema Registry, producers and consumers must coordinate schema changes manually, risking deserialization failures when schemas don't match. With Schema Registry, schemas are versioned and checked for compatibility before being registered, preventing breaking changes.

The registry supports multiple serialization formats including Avro, Protobuf, and JSON Schema, with Avro being the most common choice due to its compact binary format and strong schema evolution support.

// Producer with Avro and Schema Registry
props.put("schema.registry.url", "http://localhost:8081");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
 
// Consumer
props.put("value.deserializer", "io.confluent.kafka.serializers.KafkaAvroDeserializer");
props.put("specific.avro.reader", "true");

Compatibility modes:

  • BACKWARD: New schema can read old data
  • FORWARD: Old schema can read new data
  • FULL: Both backward and forward compatible
  • NONE: No compatibility checks

What are common Kafka production issues and their solutions?

Production Kafka clusters encounter predictable categories of issues that operators should be prepared to diagnose and resolve. Under-replicated partitions typically indicate broker problems—check disk I/O, network connectivity, and JVM health. Consumer lag requires analyzing whether the bottleneck is processing speed, consumer count, or external dependencies. Rebalancing storms often result from unstable consumers with heartbeat timeouts; increasing session timeouts and using cooperative rebalancing helps.

IssueSymptomsSolution
Under-replicated partitionsAlerts, slow writesCheck broker health, network, disk I/O
Consumer lagGrowing lag metricsScale consumers, optimize processing
Rebalancing stormsFrequent rebalances, high latencyIncrease timeouts, use cooperative assignor
Disk fullWrite failuresAdd retention policies, expand storage
Uneven partition distributionSome brokers overloadedRun partition reassignment
Producer timeoutsSend failuresCheck broker health, tune timeouts

Quick Reference

TopicKey Points
ArchitectureBrokers, topics, partitions, replicas, consumer groups
OrderingGuaranteed within partition only
Delivery guaranteesacks=0/1/all, idempotent producer, transactions
Consumer groupsLoad balancing, rebalancing, offset management
Exactly-onceIdempotent + transactional producer + read_committed
Kafka StreamsLibrary, KStream vs KTable, windowing
ReplicationISR, min.insync.replicas, leader election
MonitoringLag, under-replicated partitions, throughput

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides