An order service publishes an event, Kafka accepts it, and a fulfillment service stores the result in its database. That looks like one simple path, but several independent steps must succeed: the producer must serialize the record, choose the correct partition, survive retries, and receive an acknowledgment. The consumer must join the right group, poll the assigned partition, deserialize the record, finish the business operation, and commit the correct offset.
A mistake at any point can produce a failure that is difficult to diagnose. A lost acknowledgment can cause the producer to retry a batch. An offset committed too early can skip an unprocessed order. An offset committed too late can repeat a completed database update. Slow processing can trigger a consumer rebalance, move the partition to another instance, and cause the same records to appear again. A malformed record can stop an entire partition if deserialization errors are not handled deliberately.
The correct design is not a single magic setting. It is an end-to-end agreement between producer behavior, topic partitioning, consumer processing, offset management, and external side effects. This tutorial builds that workflow around an order-processing scenario and explains how to reason about every failure boundary.
The Workflow We Need to Protect
Assume an e-commerce platform contains these components:
CheckoutServiceaccepts an order and publishes lifecycle events.FulfillmentServicereserves products and prepares the shipment.NotificationServicesends customer updates.OrderProjectionServicestores a queryable view for support agents.- A Kafka topic named
order-lifecyclecarries the events.
The expected flow is:
Customer
|
v
CheckoutService
|
| produces order events
v
Kafka topic: order-lifecycle
|
+----------------------+----------------------+
| | |
v v v
FulfillmentService NotificationService OrderProjectionService
|
v
Operational database
Each downstream service must receive every relevant event independently. Therefore, each service uses a different consumer group.
Within one service, several instances may share the same group to process partitions in parallel. Kafka then assigns each partition to one member of that group at a time.
For a single order, the producer may emit this sequence:
ORDER_ACCEPTED
PAYMENT_CONFIRMED
INVENTORY_RESERVED
READY_FOR_SHIPMENT
The design must preserve that sequence for one order while allowing unrelated orders to be processed concurrently.
Start with the Delivery Contract
Before selecting properties, define what the system means by successful delivery.
For this scenario:
- An order event must not be silently lost after
CheckoutServicereports success. - Events for the same order must remain in Kafka partition order.
- Consumer processing may occasionally repeat after failures.
- Downstream operations must tolerate that repetition.
- A consumer must not commit progress before its business operation has completed.
- A temporary consumer outage must not make the group resume from an unexpected location.
This is an at-least-once processing design at the application boundary. Kafka can prevent duplicates introduced by producer retries when idempotence is configured correctly, but the complete business workflow can still repeat. For example, a consumer can update its database and crash before committing the Kafka offset. The record is then read again after restart.
The application must therefore distinguish two questions:
- Did Kafka store the event reliably?
- Did the consumer complete the business side effect exactly once?
Those questions are related, but they are not the same.
Producer Stage 1: Create a Stable Record Contract
Kafka brokers treat keys and values as bytes. Producers serialize application data before sending it, and consumers must use matching deserializers.
The producer and consumer teams must agree on:
- The topic name
- The key format
- The value format
- Relevant headers
- The meaning of timestamps
- How incompatible records are handled
For the order pipeline, the order identifier is the natural key:
Key:
ORD-84721
Value:
type = PAYMENT_CONFIRMED
orderId = ORD-84721
status = PAID
occurredAt = 2026-06-20T15:30:00Z
Headers:
event-version = 1
trace-id = c94a72
The key is not only metadata. It determines partition placement when key-based partitioning is used.
A simple producer configuration can express matching serializers:
bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer
client.id=checkout-producer
The consumer must configure the matching deserializers:
bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
Kafka does not inspect the business structure and reject a mismatched payload on behalf of the application. If the producer writes a format the consumer cannot interpret, the failure appears at consumption time.
Producer Stage 2: Preserve Per-Order Ordering with the Key
A Kafka topic is split into partitions. Ordering exists inside a partition, not across the entire topic.
When the producer sends every event for ORD-84721 with the same key, key-based partitioning routes those records to the same partition. Their order can then be preserved by Kafka.
Partition 0: ORD-10002, ORD-33419, ORD-10002
Partition 1: ORD-84721, ORD-84721, ORD-84721
Partition 2: ORD-55291, ORD-71903
This gives the platform two useful properties:
- Events for one order stay together.
- Different orders can be processed in parallel on different partitions.
Using a random event identifier as the key would spread traffic, but related events could land in different partitions. That would remove the per-order ordering guarantee.
Using a value with very poor distribution can create a hot partition. For example, partitioning only by country can overload one partition when most orders come from one region. A good key identifies the ordering boundary and distributes records across the topic reasonably well.
A producer call using the order identifier as the key can be expressed with the native Java client:
var record = new ProducerRecord<String, String>(
"order-lifecycle",
orderId,
serializedEvent
);
producer.send(record, (metadata, error) -> {
if (error != null) {
deliveryFailureHandler.accept(orderId, error);
return;
}
deliverySuccessHandler.accept(
orderId,
metadata.partition(),
metadata.offset()
);
});
The callback is important because the send operation is normally asynchronous. Passing a record to the client library does not by itself prove that Kafka stored it.
Producer Stage 3: Understand Batching
Kafka producers collect records in memory and organize them into batches per topic partition.
CheckoutService memory
Batch for partition 0:
ORD-10002
ORD-33419
Batch for partition 1:
ORD-84721
ORD-84721
Batch for partition 2:
ORD-55291
Batching improves throughput because the client sends fewer network requests. It also means:
- Memory is used while records wait to be sent or acknowledged.
- Acknowledgments apply to batches.
- Retriable failures can cause a complete batch to be sent again.
- Larger batches usually favor throughput.
- Waiting longer for records to accumulate can increase latency.
The important producer settings include:
batch.size=<tested-batch-size>
linger.ms=<tested-wait-time>
buffer.memory=<tested-buffer-limit>
delivery.timeout.ms=<end-to-end-delivery-limit>
request.timeout.ms=<single-request-limit>
max.block.ms=<maximum-buffer-wait>
These values should be chosen together. Increasing the accumulation delay may create fuller batches, but it also delays low-volume events. Increasing the buffer allows more pending data, but it also increases memory requirements and the amount of unsent data that can disappear if the process terminates.
Do not tune these settings independently from message size, production rate, acknowledgment latency, and the application's memory limit.
Producer Stage 4: Require a Meaningful Acknowledgment
Kafka producers can use different acknowledgment modes.
No acknowledgment
With acks=0, the producer does not wait for broker confirmation. This reduces waiting, but the application cannot know whether the cluster stored the record.
That is unsuitable for critical order state.
Leader acknowledgment
With acks=1, the partition leader acknowledges after accepting the data. The record may not yet be replicated to follower replicas. A leader failure at the wrong time can therefore lose data that the producer considered delivered.
In-sync replica acknowledgment
With acks=all, the leader acknowledges after the write satisfies the in-sync replica requirement. The strength of this setting depends on the topic or broker value of min.insync.replicas.
A practical producer baseline for important events is:
acks=all
enable.idempotence=true
The cluster must also have a suitable replication policy. If min.insync.replicas allows only the leader to be enough, acks=all does not provide the same protection as requiring multiple in-sync copies.
Producer and topic settings must therefore be reviewed as one reliability policy.
Producer Stage 5: Make Retries Safe
A common ambiguous failure looks like this:
1. Producer sends a batch.
2. Kafka stores and replicates it.
3. Kafka returns an acknowledgment.
4. The acknowledgment is lost.
5. Producer reaches a timeout.
6. Producer retries the batch.
The producer cannot distinguish "Kafka rejected the batch" from "Kafka accepted the batch but the response disappeared."
Retries are necessary, but they must not create duplicate Kafka records. Producer idempotence enables Kafka to recognize retry attempts from the same producer session and avoid appending the same sequence again.
The related settings must remain compatible:
enable.idempotence=true
acks=all
retries=<allow-retriable-recovery>
max.in.flight.requests.per.connection=<ordering-safe-value>
The exact values should follow the client version's compatibility rules. The architectural requirement is that retries, acknowledgments, and in-flight requests must not be configured in a combination that disables idempotence or weakens ordering.
Producer idempotence solves retry duplication inside Kafka. It does not prevent an application from intentionally sending the same logical event twice in separate calls. It also does not make database changes in a consumer globally exactly once.
Producer Stage 6: Handle Metadata and Broker Changes
A producer does not permanently send all records to one bootstrap address. Bootstrap servers are initial contact points. The client retrieves cluster metadata, learns which broker leads each partition, and caches that information.
If a partition leader changes, a send can temporarily reach the old leader. The broker returns an error, the producer refreshes metadata, and the client retries against the new leader.
This is normally automatic, but it affects troubleshooting. A brief producer log message about stale leadership does not always mean the application is broken. Repeated metadata failures, however, can indicate:
- Unresolvable advertised broker addresses
- Blocked network paths
- Incorrect listener configuration
- Unstable brokers
- A client connected through the wrong listener
The addresses returned through advertised.listeners must be reachable from the client environment.
Producer Stage 7: Compress at the Producer When It Helps
Kafka can compress an entire producer batch before sending it. The broker can retain that compressed representation and send it to consumers, where the client decompresses it automatically.
A producer can select a supported codec:
compression.type=zstd
Other supported choices include gzip, snappy, and lz4. The right choice depends on CPU cost, network throughput, storage requirements, data format, and latency goals.
Compression is often effective for repeated text structures because similar fields appear throughout a batch. It usually helps less when the payload is already compact or encrypted.
Keeping broker compression configured to use the producer's codec avoids unnecessary decompression and recompression on the broker.
The Complete Producer Path
The production sequence now looks like this:
Business operation creates event
|
v
Build key, value, and headers
|
v
Serialize key and value
|
v
Use order ID to select partition
|
v
Append record to partition batch
|
v
Optionally compress the batch
|
v
Send batch to partition leader
|
v
Replicate according to cluster policy
|
v
Receive acknowledgment
|
+-----+------+
| |
success retriable error
| |
v v
complete retry safely
callback with idempotence
Only after successful completion should the producer-side application treat the Kafka delivery as accepted.
Consumer Stage 1: Use Consumer Groups Correctly
A consumer group represents one logical subscriber.
Suppose three services must independently receive every order event:
FulfillmentService group.id=fulfillment
NotificationService group.id=notifications
OrderProjectionService group.id=order-projection
Because the group IDs differ, each service group receives the complete topic stream.
Within FulfillmentService, several instances can share the same group:
order-lifecycle partitions
Partition 0 -> Fulfillment instance A
Partition 1 -> Fulfillment instance B
Partition 2 -> Fulfillment instance C
Partition 3 -> Fulfillment instance A
Each partition is assigned to one member of the group at a time. This protects partition order while allowing different partitions to be consumed concurrently.
If two business services accidentally use the same group ID, they do not both receive every event. They divide the partitions between them as if they were instances of the same subscriber. This is a serious configuration mistake.
Consumer Stage 2: Configure the Starting Position Deliberately
The consumer group stores committed offsets. An offset identifies the group's processed position in a partition.
When a group has no committed offset, auto.offset.reset decides where consumption begins.
For a projection that must rebuild all available order state, a suitable configuration is:
group.id=order-projection
auto.offset.reset=earliest
enable.auto.commit=false
earliest tells the consumer to start from the earliest retained record when no valid committed offset exists.
A notification service that should only react to new events might make a different decision. The important point is that the choice must match the business function.
Offsets also have a retention lifecycle. If a group remains inactive long enough for its stored offsets to expire, auto.offset.reset becomes relevant again. The offset retention policy should therefore be compatible with the maximum expected consumer outage.
Consumer Stage 3: Poll, Deserialize, Process, and Commit
The main consumer workflow is:
Join consumer group
|
v
Receive partition assignment
|
v
Poll Kafka for records
|
v
Deserialize keys and values
|
v
Run business processing
|
v
Commit completed offsets
|
v
Poll again
A simplified Java consumer loop is:
consumer.subscribe(List.of("order-lifecycle"));
while (running) {
var records = consumer.poll(Duration.ofMillis(pollWaitMillis));
for (var record : records) {
processOrderEvent(
record.key(),
record.value(),
record.headers()
);
}
consumer.commitSync();
}
This example commits only after the returned records have been processed. Production code must also define what happens when one record fails in the middle of a batch. Blindly committing the whole batch after partial success can skip failed work.
The application needs an explicit policy for:
- Retriable business failures
- Permanently invalid records
- Deserialization errors
- Partial batch completion
- Shutdown while records are in progress
- Partition revocation during a rebalance
Consumer Stage 4: Commit Offsets After the Side Effect
Automatic offset commits simplify code, but they can advance the group's position before processing is safely complete.
Consider this failure:
1. Consumer polls offsets 140 through 149.
2. Automatic commit records progress through 149.
3. Consumer processes offsets 140 through 144.
4. Process crashes before handling 145 through 149.
5. Consumer restarts after offset 149.
Offsets 145 through 149 can be skipped because Kafka believes the group has already moved beyond them.
For critical processing, disable automatic commits:
enable.auto.commit=false
Then commit only after the corresponding work has succeeded.
This produces a different failure possibility:
1. Consumer reads offset 145.
2. Database update succeeds.
3. Consumer crashes before committing offset 145.
4. Consumer restarts and reads offset 145 again.
The event is repeated, but it is not lost. This is normally the safer failure mode for critical business data, provided the database operation can tolerate repetition.
The External Database Consistency Gap
Kafka offset commits and database transactions are separate operations. Without a shared distributed transaction, the consumer cannot make both changes indivisible.
There are two unsafe orders:
| Operation order | Failure risk |
|---|---|
| Commit Kafka offset, then update database | Database failure can cause lost processing |
| Update database, then commit Kafka offset | Commit failure can cause repeated processing |
The second order is usually preferable when loss is unacceptable:
Read Kafka record
|
v
Apply database change
|
v
Commit Kafka offset
A repeated event must not corrupt the database. The consumer should design updates to be idempotent where possible. Practical approaches include checking the current entity version, recognizing an already applied event, or making the state transition safe to repeat.
The important architecture rule is:
Assume a record can be delivered to business logic more than once.
Producer idempotence does not remove this consumer-side repetition window.
Consumer Stage 5: Size Poll Batches Around Processing Time
Kafka consumers receive batches. Several properties influence how much data a fetch can return:
fetch.min.bytescontrols how much data the broker should try to accumulate before replying.fetch.max.wait.mslimits how long the broker waits when the minimum has not been reached.fetch.max.byteslimits the total fetch response size.max.partition.fetch.byteslimits data per partition.max.poll.recordslimits how many records are returned to the application in one poll.
Larger batches can improve throughput by reducing request overhead. They also require more memory and more processing time before the next poll.
The settings must match the consumer workload. A service that performs a quick in-memory update can handle larger polls than a service that calls a slow external system for every record.
A useful tuning process is:
- Measure average and worst-case record processing time.
- Measure record size distribution.
- Select a poll size the consumer can complete inside its poll interval.
- Verify memory use under backlog conditions.
- Measure consumer lag.
- Adjust one related group of settings at a time.
- Repeat with realistic failures and traffic.
Consumer Stage 6: Prevent Slow Processing from Triggering Rebalances
Kafka uses heartbeats and polling behavior to determine whether a consumer is healthy.
Important timeout concepts include:
heartbeat.interval.ms: how frequently the consumer reports that it is alive.session.timeout.ms: how long the group coordinator waits before treating a silent member as failed.max.poll.interval.ms: the maximum allowed time between poll calls.
If processing a batch takes longer than max.poll.interval.ms, the consumer may be removed from the group. Kafka then reassigns its partitions.
That can create this sequence:
Consumer A polls a large batch
|
v
Processing takes too long
|
v
Consumer A exceeds poll interval
|
v
Group rebalances
|
v
Partition moves to Consumer B
|
v
Consumer B reads from last committed offset
If Consumer A had completed some work without committing it, Consumer B can process that work again.
The solution is not simply to increase every timeout. Instead:
- Reduce
max.poll.recordswhen a batch takes too long. - Avoid unbounded external calls inside the poll loop.
- Benchmark worst-case processing time.
- Keep the polling and processing model understandable.
- Increase
max.poll.interval.msonly when long processing is a legitimate requirement. - Monitor rebalance frequency and consumer lag.
Consumer Stage 7: Understand Rebalancing
A rebalance occurs when the membership or partition assignment of a consumer group changes.
Common triggers include:
- A new consumer joins.
- A consumer shuts down.
- A consumer stops heartbeating.
- A consumer exceeds the allowed poll interval.
- Topic partition availability changes.
During a rebalance, partitions can move between instances. The application must be prepared to stop work on revoked partitions, commit safe progress, and continue from the correct position after assignment.
Kafka supports several assignment approaches.
Range assignment
Range assignment groups partition indexes in contiguous ranges. It can be useful when consumers subscribe to several related topics and partitions with the same index should stay on the same consumer.
Round-robin assignment
Round-robin assignment distributes partitions more evenly across consumers, but a rebalance can move many assignments.
Sticky assignment
Sticky assignment attempts to keep existing ownership while still balancing the group. This reduces unnecessary movement.
Cooperative sticky assignment
Cooperative sticky assignment moves only the partitions that need to change instead of revoking every assignment at once. This reduces disruption during incremental rebalancing.
The correct strategy depends on whether the system prioritizes co-partitioning, even distribution, or minimal movement during membership changes.
Newer KRaft-based clusters can also use a newer consumer rebalance protocol in which assignment decisions move to the server side and the classic group leader role is removed. Teams should verify client, broker, and cluster-mode support before adopting it.
Consumer Stage 8: Reduce Restart Rebalances with Static Membership
A routine deployment can cause an instance to leave and quickly rejoin the same group. Without a stable identity, Kafka may treat it as a new member and trigger partition movement.
Static membership assigns a stable group.instance.id to each consumer instance:
group.id=fulfillment
group.instance.id=fulfillment-instance-a
When the same instance returns within the allowed session window, the coordinator can recognize it and retain its previous assignment.
Every active instance must have a unique static identifier. Reusing one identifier for two running instances creates a membership conflict.
Static membership is useful for controlled restarts, but it does not replace correct timeout configuration or rebalance handling.
Consumer Stage 9: Handle Deserialization Failures Explicitly
Kafka brokers do not validate that a payload matches the consumer's expected structure. A malformed or incompatible record can therefore reach the consumer.
Without an error strategy, the consumer may repeatedly fail on the same offset and stop making progress on that partition.
The application needs to decide:
- Can the record be retried?
- Is the payload permanently unreadable?
- Should the record be isolated for investigation?
- Can processing continue after the bad record?
- What metadata is required to diagnose the producer and schema version?
At minimum, failure records should preserve:
- Topic
- Partition
- Offset
- Key
- Producer identity when available
- Trace information
- Deserialization error
- Payload handling status
Do not silently skip malformed business events. Skipping restores throughput but can create hidden data loss.
Consumer Stage 10: Scale with Partitions, Not Only Instances
The maximum active consumers in one group is limited by the number of assigned partitions.
Topic partitions: 4
Consumers in group: 6
Consumer A -> partition 0
Consumer B -> partition 1
Consumer C -> partition 2
Consumer D -> partition 3
Consumer E -> idle
Consumer F -> idle
Adding more service instances does not increase parallelism after every partition already has an owner.
When lag grows, investigate in this order:
- Is the producer rate higher than the consumer processing rate?
- Is one partition much busier than the others?
- Is the message key creating uneven distribution?
- Is each record waiting on a slow external dependency?
- Are rebalances interrupting useful work?
- Is the poll batch too large or too small?
- Are there enough partitions for the required parallelism?
- Is the consumer spending too much time retrying failures?
Consumer lag is the difference between the newest available offset and the group's committed progress for each partition. Monitor lag per partition, not only as one total number. A single hot partition can be hidden by an acceptable cluster-wide average.
Parallel Workers Require a Commit Watermark
A consumer may dispatch polled records to a worker pool to increase concurrency. This is dangerous when workers complete out of order.
Suppose one partition returns offsets:
200, 201, 202, 203
Workers finish in this order:
200 completed
202 completed
203 completed
201 still running
The consumer cannot safely commit through 203 because offset 201 is incomplete. It may only advance to the highest contiguous completed position.
Completed: 200, 202, 203
Missing: 201
Safe commit boundary: after 200
Once 201 finishes, the safe boundary can move past 203.
This boundary is often treated as a completion watermark. A worker-pool design must also preserve per-key ordering by routing records with the same key to the same worker or by processing each partition serially.
Adding threads without a precise offset algorithm can create silent loss.
Full End-to-End Failure Analysis
The following table summarizes the most important failure windows.
| Failure point | Possible result | Required protection |
|---|---|---|
| Producer fails before send | Event never reaches Kafka | Application-level recovery or durable publishing workflow |
| Broker stores batch but acknowledgment is lost | Producer retries | Producer idempotence |
| Weak acknowledgment before replication | Acknowledged data can disappear with leader failure | acks=all with suitable in-sync replica policy |
| Consumer commits before processing | Unprocessed event can be skipped | Manual commit after success |
| Database update succeeds before offset commit | Event is processed again | Idempotent consumer operation |
| Consumer processes too slowly | Rebalance and repeated work | Poll sizing and timeout alignment |
| Consumer cannot deserialize one record | Partition can stop progressing | Explicit error handling |
| Group offsets expire | Consumer uses reset policy | Offset retention aligned with outage expectations |
| More consumers than partitions | Extra instances remain idle | Partition and capacity planning |
| Uneven record keys | Hot partition and rising lag | Better partition-key design |
A Practical End-to-End Configuration Baseline
The following producer configuration expresses the important intent without pretending that one set of tuning values fits every system:
bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
client.id=checkout-producer
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer
acks=all
enable.idempotence=true
compression.type=zstd
batch.size=<measured-value>
linger.ms=<measured-value>
buffer.memory=<measured-value>
delivery.timeout.ms=<measured-value>
request.timeout.ms=<measured-value>
The consumer baseline is:
bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
client.id=fulfillment-consumer
group.id=fulfillment
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
enable.auto.commit=false
auto.offset.reset=earliest
max.poll.records=<measured-value>
max.poll.interval.ms=<measured-value>
session.timeout.ms=<measured-value>
heartbeat.interval.ms=<measured-value>
These are not independent switches. The final values must be validated against:
- Record size
- Event rate
- Number of partitions
- Consumer processing cost
- External dependency latency
- Memory limits
- Required recovery time
- Acceptable consumer lag
- Broker and network behavior during failure
Testing the Complete Workflow
A reliable pipeline must be tested under ambiguity, not only under success.
Test 1: Producer acknowledgment loss
- Publish records continuously.
- Introduce a failure that interrupts acknowledgments.
- Allow the producer to retry.
- Verify that retry handling does not append duplicate Kafka records.
- Confirm that application callbacks report final success or failure correctly.
Test 2: Partition ordering
- Publish several state changes using the same order ID.
- Publish unrelated orders concurrently.
- Verify that one order's events remain in partition order.
- Verify that unrelated orders can be processed in parallel.
Test 3: Consumer crash after database success
- Process an event and complete the database update.
- Stop the process before committing the offset.
- Restart the consumer.
- Verify that the record is delivered again.
- Confirm that repeated processing leaves the database correct.
Test 4: Consumer crash after offset commit
This test should demonstrate why committing early is unsafe.
- Commit progress before applying the database update.
- Stop the consumer.
- Restart it.
- Verify that the unprocessed record is no longer returned.
- Remove the unsafe commit order from the implementation.
Test 5: Slow processing and rebalance
- Make event handling slower than the allowed poll interval.
- Observe the consumer leave the group.
- Confirm that its partitions move to another instance.
- Measure duplicate work and lag.
- Reduce poll size or adjust the processing model.
Test 6: Malformed payload
- Publish a record that cannot be deserialized.
- Verify that the error is captured with topic, partition, and offset.
- Confirm that the handling policy is applied.
- Ensure one bad record does not create an unexplained permanent stall.
Test 7: New consumer group
- Create a group with no committed offsets.
- Start it with
auto.offset.reset=earliest. - Verify that it reads retained history.
- Repeat with a different reset policy and confirm the intended behavior.
Test 8: Scale beyond partition count
- Start one consumer per partition.
- Add more consumers to the same group.
- Verify that the extra members remain idle.
- Confirm that capacity planning is based on partitions and processing rate.
Common Mistakes
Treating send as immediate success
The producer API is asynchronous. Success must come from the completion result, not merely from calling the method.
Selecting a key without considering ordering
The key defines which records stay together. A random identifier can break entity ordering, while a low-cardinality key can overload one partition.
Using acks=all without reviewing replica requirements
The acknowledgment setting depends on the cluster's in-sync replica policy. The producer and topic must be designed together.
Assuming producer idempotence makes the whole workflow exactly once
It protects against duplicates caused by producer retries. It does not make an external database update and a Kafka offset commit atomic.
Sharing a consumer group between independent services
Members of one group split partitions. Independent subscribers require different group IDs.
Keeping automatic commits for critical work
Automatic progress can move ahead of unfinished processing. Manual commit after success is safer when loss is unacceptable.
Increasing consumer count without checking partitions
Consumers beyond the number of partitions do not add useful parallelism within the same group.
Increasing timeouts instead of fixing slow processing
Long timeouts can hide an overloaded or blocked consumer. Poll size, processing design, and external calls should be measured first.
Ignoring malformed records
A single incompatible record can repeatedly block a partition. Error handling must be part of the consumer design.
Monitoring only service health
A running consumer can still be far behind. Monitor partition-level lag, rebalances, processing failures, memory, poll duration, and producer delivery errors.
Delivery Checklist
Before releasing the producer:
- The topic and record contract are documented.
- The entity key preserves the required ordering boundary.
- Serializers match the agreed format.
- Send completion is observed asynchronously.
- Acknowledgments and replica requirements form one durability policy.
- Producer idempotence remains enabled.
- Retry and in-flight settings preserve ordering.
- Batch, timeout, and memory values have been tested.
- Broker addresses returned to the client are reachable.
- Compression has been evaluated with realistic payloads.
Before releasing the consumer:
- Each logical subscriber has its own group ID.
- The reset policy matches the service's recovery purpose.
- Automatic commits are disabled for critical processing.
- Offsets are committed only after successful work.
- Database operations tolerate repeated delivery.
- Poll size fits inside the allowed poll interval.
- Heartbeat, session, and poll timeouts are aligned.
- Rebalance behavior has been tested.
- Deserialization failures have an explicit handling path.
- Offset retention supports the maximum expected outage.
- Consumer lag is monitored per partition.
- Scaling expectations match the number of partitions.
Conclusion
A Kafka producer-consumer pipeline is a chain of independent durability and progress decisions. The producer must serialize correctly, select a stable key, batch efficiently, request meaningful acknowledgments, and retry without duplicating records. The consumer must join the correct group, process only its assigned partitions, handle rebalances, finish business work before committing offsets, and tolerate repeated delivery around external side effects.
The safest mental model is not "Kafka delivers each business action once." It is "Kafka preserves records and positions, while applications define when publishing and processing are complete."
When producer reliability, partitioning, consumer groups, offset commits, processing time, and database behavior are designed together, failures become recoverable and explainable instead of silent and destructive.