Preventing Lost, Duplicate, and Out-of-Order Orders Across a Kafka Producer-Consumer Pipeline

An order service publishes an event, Kafka accepts it, and a fulfillment service stores the result in its database. That looks like one simple path, but several independent steps must succeed: the producer must serialize the record, choose the correct partition, survive retries, and receive an acknowledgment. The consumer must join the right group, poll the assigned partition, deserialize the record, finish the business operation, and commit the correct offset.

A mistake at any point can produce a failure that is difficult to diagnose. A lost acknowledgment can cause the producer to retry a batch. An offset committed too early can skip an unprocessed order. An offset committed too late can repeat a completed database update. Slow processing can trigger a consumer rebalance, move the partition to another instance, and cause the same records to appear again. A malformed record can stop an entire partition if deserialization errors are not handled deliberately.

The correct design is not a single magic setting. It is an end-to-end agreement between producer behavior, topic partitioning, consumer processing, offset management, and external side effects. This tutorial builds that workflow around an order-processing scenario and explains how to reason about every failure boundary.

The Workflow We Need to Protect

Assume an e-commerce platform contains these components:

CheckoutService accepts an order and publishes lifecycle events.
FulfillmentService reserves products and prepares the shipment.
NotificationService sends customer updates.
OrderProjectionService stores a queryable view for support agents.
A Kafka topic named order-lifecycle carries the events.

The expected flow is:

Customer
   |
   v
CheckoutService
   |
   | produces order events
   v
Kafka topic: order-lifecycle
   |
   +----------------------+----------------------+
   |                      |                      |
   v                      v                      v
FulfillmentService  NotificationService  OrderProjectionService
   |
   v
Operational database

Each downstream service must receive every relevant event independently. Therefore, each service uses a different consumer group.

Within one service, several instances may share the same group to process partitions in parallel. Kafka then assigns each partition to one member of that group at a time.

For a single order, the producer may emit this sequence:

ORDER_ACCEPTED
PAYMENT_CONFIRMED
INVENTORY_RESERVED
READY_FOR_SHIPMENT

The design must preserve that sequence for one order while allowing unrelated orders to be processed concurrently.

Start with the Delivery Contract

Before selecting properties, define what the system means by successful delivery.

For this scenario:

An order event must not be silently lost after CheckoutService reports success.
Events for the same order must remain in Kafka partition order.
Consumer processing may occasionally repeat after failures.
Downstream operations must tolerate that repetition.
A consumer must not commit progress before its business operation has completed.
A temporary consumer outage must not make the group resume from an unexpected location.

This is an at-least-once processing design at the application boundary. Kafka can prevent duplicates introduced by producer retries when idempotence is configured correctly, but the complete business workflow can still repeat. For example, a consumer can update its database and crash before committing the Kafka offset. The record is then read again after restart.

The application must therefore distinguish two questions:

Did Kafka store the event reliably?
Did the consumer complete the business side effect exactly once?

Those questions are related, but they are not the same.

Producer Stage 1: Create a Stable Record Contract

Kafka brokers treat keys and values as bytes. Producers serialize application data before sending it, and consumers must use matching deserializers.

The producer and consumer teams must agree on:

The topic name
The key format
The value format
Relevant headers
The meaning of timestamps
How incompatible records are handled

For the order pipeline, the order identifier is the natural key:

Key:
ORD-84721

Value:
type = PAYMENT_CONFIRMED
orderId = ORD-84721
status = PAID
occurredAt = 2026-06-20T15:30:00Z

Headers:
event-version = 1
trace-id = c94a72

The key is not only metadata. It determines partition placement when key-based partitioning is used.

A simple producer configuration can express matching serializers:

bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer
client.id=checkout-producer

The consumer must configure the matching deserializers:

bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer

Kafka does not inspect the business structure and reject a mismatched payload on behalf of the application. If the producer writes a format the consumer cannot interpret, the failure appears at consumption time.

Producer Stage 2: Preserve Per-Order Ordering with the Key

A Kafka topic is split into partitions. Ordering exists inside a partition, not across the entire topic.

When the producer sends every event for ORD-84721 with the same key, key-based partitioning routes those records to the same partition. Their order can then be preserved by Kafka.

Partition 0: ORD-10002, ORD-33419, ORD-10002
Partition 1: ORD-84721, ORD-84721, ORD-84721
Partition 2: ORD-55291, ORD-71903

This gives the platform two useful properties:

Events for one order stay together.
Different orders can be processed in parallel on different partitions.

Using a random event identifier as the key would spread traffic, but related events could land in different partitions. That would remove the per-order ordering guarantee.

Using a value with very poor distribution can create a hot partition. For example, partitioning only by country can overload one partition when most orders come from one region. A good key identifies the ordering boundary and distributes records across the topic reasonably well.

A producer call using the order identifier as the key can be expressed with the native Java client:

var record = new ProducerRecord<String, String>(
        "order-lifecycle",
        orderId,
        serializedEvent
);

producer.send(record, (metadata, error) -> {
    if (error != null) {
        deliveryFailureHandler.accept(orderId, error);
        return;
    }

    deliverySuccessHandler.accept(
            orderId,
            metadata.partition(),
            metadata.offset()
    );
});

The callback is important because the send operation is normally asynchronous. Passing a record to the client library does not by itself prove that Kafka stored it.

Producer Stage 3: Understand Batching

Kafka producers collect records in memory and organize them into batches per topic partition.

CheckoutService memory

Batch for partition 0:
  ORD-10002
  ORD-33419

Batch for partition 1:
  ORD-84721
  ORD-84721

Batch for partition 2:
  ORD-55291

Batching improves throughput because the client sends fewer network requests. It also means:

Memory is used while records wait to be sent or acknowledged.
Acknowledgments apply to batches.
Retriable failures can cause a complete batch to be sent again.
Larger batches usually favor throughput.
Waiting longer for records to accumulate can increase latency.

The important producer settings include:

batch.size=<tested-batch-size>
linger.ms=<tested-wait-time>
buffer.memory=<tested-buffer-limit>
delivery.timeout.ms=<end-to-end-delivery-limit>
request.timeout.ms=<single-request-limit>
max.block.ms=<maximum-buffer-wait>

These values should be chosen together. Increasing the accumulation delay may create fuller batches, but it also delays low-volume events. Increasing the buffer allows more pending data, but it also increases memory requirements and the amount of unsent data that can disappear if the process terminates.

Do not tune these settings independently from message size, production rate, acknowledgment latency, and the application's memory limit.

Producer Stage 4: Require a Meaningful Acknowledgment

Kafka producers can use different acknowledgment modes.

No acknowledgment

With acks=0, the producer does not wait for broker confirmation. This reduces waiting, but the application cannot know whether the cluster stored the record.

That is unsuitable for critical order state.

Leader acknowledgment

With acks=1, the partition leader acknowledges after accepting the data. The record may not yet be replicated to follower replicas. A leader failure at the wrong time can therefore lose data that the producer considered delivered.

In-sync replica acknowledgment

With acks=all, the leader acknowledges after the write satisfies the in-sync replica requirement. The strength of this setting depends on the topic or broker value of min.insync.replicas.

A practical producer baseline for important events is:

acks=all
enable.idempotence=true

The cluster must also have a suitable replication policy. If min.insync.replicas allows only the leader to be enough, acks=all does not provide the same protection as requiring multiple in-sync copies.

Producer and topic settings must therefore be reviewed as one reliability policy.

Producer Stage 5: Make Retries Safe

A common ambiguous failure looks like this:

1. Producer sends a batch.
2. Kafka stores and replicates it.
3. Kafka returns an acknowledgment.
4. The acknowledgment is lost.
5. Producer reaches a timeout.
6. Producer retries the batch.

The producer cannot distinguish "Kafka rejected the batch" from "Kafka accepted the batch but the response disappeared."

Retries are necessary, but they must not create duplicate Kafka records. Producer idempotence enables Kafka to recognize retry attempts from the same producer session and avoid appending the same sequence again.

The related settings must remain compatible:

enable.idempotence=true
acks=all
retries=<allow-retriable-recovery>
max.in.flight.requests.per.connection=<ordering-safe-value>

The exact values should follow the client version's compatibility rules. The architectural requirement is that retries, acknowledgments, and in-flight requests must not be configured in a combination that disables idempotence or weakens ordering.

Producer idempotence solves retry duplication inside Kafka. It does not prevent an application from intentionally sending the same logical event twice in separate calls. It also does not make database changes in a consumer globally exactly once.

Producer Stage 6: Handle Metadata and Broker Changes

A producer does not permanently send all records to one bootstrap address. Bootstrap servers are initial contact points. The client retrieves cluster metadata, learns which broker leads each partition, and caches that information.

If a partition leader changes, a send can temporarily reach the old leader. The broker returns an error, the producer refreshes metadata, and the client retries against the new leader.

This is normally automatic, but it affects troubleshooting. A brief producer log message about stale leadership does not always mean the application is broken. Repeated metadata failures, however, can indicate:

Unresolvable advertised broker addresses
Blocked network paths
Incorrect listener configuration
Unstable brokers
A client connected through the wrong listener

The addresses returned through advertised.listeners must be reachable from the client environment.

Producer Stage 7: Compress at the Producer When It Helps

Kafka can compress an entire producer batch before sending it. The broker can retain that compressed representation and send it to consumers, where the client decompresses it automatically.

A producer can select a supported codec:

compression.type=zstd

Other supported choices include gzip, snappy, and lz4. The right choice depends on CPU cost, network throughput, storage requirements, data format, and latency goals.

Compression is often effective for repeated text structures because similar fields appear throughout a batch. It usually helps less when the payload is already compact or encrypted.

Keeping broker compression configured to use the producer's codec avoids unnecessary decompression and recompression on the broker.

The Complete Producer Path

The production sequence now looks like this:

Business operation creates event
            |
            v
Build key, value, and headers
            |
            v
Serialize key and value
            |
            v
Use order ID to select partition
            |
            v
Append record to partition batch
            |
            v
Optionally compress the batch
            |
            v
Send batch to partition leader
            |
            v
Replicate according to cluster policy
            |
            v
Receive acknowledgment
            |
      +-----+------+
      |            |
   success      retriable error
      |            |
      v            v
complete       retry safely
callback       with idempotence

Only after successful completion should the producer-side application treat the Kafka delivery as accepted.

Consumer Stage 1: Use Consumer Groups Correctly

A consumer group represents one logical subscriber.

Suppose three services must independently receive every order event:

FulfillmentService     group.id=fulfillment
NotificationService    group.id=notifications
OrderProjectionService group.id=order-projection

Because the group IDs differ, each service group receives the complete topic stream.

Within FulfillmentService, several instances can share the same group:

order-lifecycle partitions

Partition 0 -> Fulfillment instance A
Partition 1 -> Fulfillment instance B
Partition 2 -> Fulfillment instance C
Partition 3 -> Fulfillment instance A

Each partition is assigned to one member of the group at a time. This protects partition order while allowing different partitions to be consumed concurrently.

If two business services accidentally use the same group ID, they do not both receive every event. They divide the partitions between them as if they were instances of the same subscriber. This is a serious configuration mistake.

Consumer Stage 2: Configure the Starting Position Deliberately

The consumer group stores committed offsets. An offset identifies the group's processed position in a partition.

When a group has no committed offset, auto.offset.reset decides where consumption begins.

For a projection that must rebuild all available order state, a suitable configuration is:

group.id=order-projection
auto.offset.reset=earliest
enable.auto.commit=false

earliest tells the consumer to start from the earliest retained record when no valid committed offset exists.

A notification service that should only react to new events might make a different decision. The important point is that the choice must match the business function.

Offsets also have a retention lifecycle. If a group remains inactive long enough for its stored offsets to expire, auto.offset.reset becomes relevant again. The offset retention policy should therefore be compatible with the maximum expected consumer outage.

Consumer Stage 3: Poll, Deserialize, Process, and Commit

The main consumer workflow is:

Join consumer group
        |
        v
Receive partition assignment
        |
        v
Poll Kafka for records
        |
        v
Deserialize keys and values
        |
        v
Run business processing
        |
        v
Commit completed offsets
        |
        v
Poll again

A simplified Java consumer loop is:

consumer.subscribe(List.of("order-lifecycle"));

while (running) {
    var records = consumer.poll(Duration.ofMillis(pollWaitMillis));

    for (var record : records) {
        processOrderEvent(
                record.key(),
                record.value(),
                record.headers()
        );
    }

    consumer.commitSync();
}

This example commits only after the returned records have been processed. Production code must also define what happens when one record fails in the middle of a batch. Blindly committing the whole batch after partial success can skip failed work.

The application needs an explicit policy for:

Retriable business failures
Permanently invalid records
Deserialization errors
Partial batch completion
Shutdown while records are in progress
Partition revocation during a rebalance

Consumer Stage 4: Commit Offsets After the Side Effect

Automatic offset commits simplify code, but they can advance the group's position before processing is safely complete.

Consider this failure:

1. Consumer polls offsets 140 through 149.
2. Automatic commit records progress through 149.
3. Consumer processes offsets 140 through 144.
4. Process crashes before handling 145 through 149.
5. Consumer restarts after offset 149.

Offsets 145 through 149 can be skipped because Kafka believes the group has already moved beyond them.

For critical processing, disable automatic commits:

enable.auto.commit=false

Then commit only after the corresponding work has succeeded.

This produces a different failure possibility:

1. Consumer reads offset 145.
2. Database update succeeds.
3. Consumer crashes before committing offset 145.
4. Consumer restarts and reads offset 145 again.

The event is repeated, but it is not lost. This is normally the safer failure mode for critical business data, provided the database operation can tolerate repetition.

The External Database Consistency Gap

Kafka offset commits and database transactions are separate operations. Without a shared distributed transaction, the consumer cannot make both changes indivisible.

There are two unsafe orders:

Operation order	Failure risk
Commit Kafka offset, then update database	Database failure can cause lost processing
Update database, then commit Kafka offset	Commit failure can cause repeated processing

The second order is usually preferable when loss is unacceptable:

Read Kafka record
      |
      v
Apply database change
      |
      v
Commit Kafka offset

A repeated event must not corrupt the database. The consumer should design updates to be idempotent where possible. Practical approaches include checking the current entity version, recognizing an already applied event, or making the state transition safe to repeat.

The important architecture rule is:

Assume a record can be delivered to business logic more than once.

Producer idempotence does not remove this consumer-side repetition window.

Consumer Stage 5: Size Poll Batches Around Processing Time

Kafka consumers receive batches. Several properties influence how much data a fetch can return:

fetch.min.bytes controls how much data the broker should try to accumulate before replying.
fetch.max.wait.ms limits how long the broker waits when the minimum has not been reached.
fetch.max.bytes limits the total fetch response size.
max.partition.fetch.bytes limits data per partition.
max.poll.records limits how many records are returned to the application in one poll.

Larger batches can improve throughput by reducing request overhead. They also require more memory and more processing time before the next poll.

The settings must match the consumer workload. A service that performs a quick in-memory update can handle larger polls than a service that calls a slow external system for every record.

A useful tuning process is:

Measure average and worst-case record processing time.
Measure record size distribution.
Select a poll size the consumer can complete inside its poll interval.
Verify memory use under backlog conditions.
Measure consumer lag.
Adjust one related group of settings at a time.
Repeat with realistic failures and traffic.

Consumer Stage 6: Prevent Slow Processing from Triggering Rebalances

Kafka uses heartbeats and polling behavior to determine whether a consumer is healthy.

Important timeout concepts include:

heartbeat.interval.ms: how frequently the consumer reports that it is alive.
session.timeout.ms: how long the group coordinator waits before treating a silent member as failed.
max.poll.interval.ms: the maximum allowed time between poll calls.

If processing a batch takes longer than max.poll.interval.ms, the consumer may be removed from the group. Kafka then reassigns its partitions.

That can create this sequence:

Consumer A polls a large batch
        |
        v
Processing takes too long
        |
        v
Consumer A exceeds poll interval
        |
        v
Group rebalances
        |
        v
Partition moves to Consumer B
        |
        v
Consumer B reads from last committed offset

If Consumer A had completed some work without committing it, Consumer B can process that work again.

The solution is not simply to increase every timeout. Instead:

Reduce max.poll.records when a batch takes too long.
Avoid unbounded external calls inside the poll loop.
Benchmark worst-case processing time.
Keep the polling and processing model understandable.
Increase max.poll.interval.ms only when long processing is a legitimate requirement.
Monitor rebalance frequency and consumer lag.

Consumer Stage 7: Understand Rebalancing

A rebalance occurs when the membership or partition assignment of a consumer group changes.

Common triggers include:

A new consumer joins.
A consumer shuts down.
A consumer stops heartbeating.
A consumer exceeds the allowed poll interval.
Topic partition availability changes.

During a rebalance, partitions can move between instances. The application must be prepared to stop work on revoked partitions, commit safe progress, and continue from the correct position after assignment.

Kafka supports several assignment approaches.

Range assignment

Range assignment groups partition indexes in contiguous ranges. It can be useful when consumers subscribe to several related topics and partitions with the same index should stay on the same consumer.

Round-robin assignment

Round-robin assignment distributes partitions more evenly across consumers, but a rebalance can move many assignments.

Sticky assignment

Sticky assignment attempts to keep existing ownership while still balancing the group. This reduces unnecessary movement.

Cooperative sticky assignment

Cooperative sticky assignment moves only the partitions that need to change instead of revoking every assignment at once. This reduces disruption during incremental rebalancing.

The correct strategy depends on whether the system prioritizes co-partitioning, even distribution, or minimal movement during membership changes.

Newer KRaft-based clusters can also use a newer consumer rebalance protocol in which assignment decisions move to the server side and the classic group leader role is removed. Teams should verify client, broker, and cluster-mode support before adopting it.

Consumer Stage 8: Reduce Restart Rebalances with Static Membership

A routine deployment can cause an instance to leave and quickly rejoin the same group. Without a stable identity, Kafka may treat it as a new member and trigger partition movement.

Static membership assigns a stable group.instance.id to each consumer instance:

group.id=fulfillment
group.instance.id=fulfillment-instance-a

When the same instance returns within the allowed session window, the coordinator can recognize it and retain its previous assignment.

Every active instance must have a unique static identifier. Reusing one identifier for two running instances creates a membership conflict.

Static membership is useful for controlled restarts, but it does not replace correct timeout configuration or rebalance handling.

Consumer Stage 9: Handle Deserialization Failures Explicitly

Kafka brokers do not validate that a payload matches the consumer's expected structure. A malformed or incompatible record can therefore reach the consumer.

Without an error strategy, the consumer may repeatedly fail on the same offset and stop making progress on that partition.

The application needs to decide:

Can the record be retried?
Is the payload permanently unreadable?
Should the record be isolated for investigation?
Can processing continue after the bad record?
What metadata is required to diagnose the producer and schema version?

At minimum, failure records should preserve:

Topic
Partition
Offset
Key
Producer identity when available
Trace information
Deserialization error
Payload handling status

Do not silently skip malformed business events. Skipping restores throughput but can create hidden data loss.

Consumer Stage 10: Scale with Partitions, Not Only Instances

The maximum active consumers in one group is limited by the number of assigned partitions.

Topic partitions: 4
Consumers in group: 6

Consumer A -> partition 0
Consumer B -> partition 1
Consumer C -> partition 2
Consumer D -> partition 3
Consumer E -> idle
Consumer F -> idle

Adding more service instances does not increase parallelism after every partition already has an owner.

When lag grows, investigate in this order:

Is the producer rate higher than the consumer processing rate?
Is one partition much busier than the others?
Is the message key creating uneven distribution?
Is each record waiting on a slow external dependency?
Are rebalances interrupting useful work?
Is the poll batch too large or too small?
Are there enough partitions for the required parallelism?
Is the consumer spending too much time retrying failures?

Consumer lag is the difference between the newest available offset and the group's committed progress for each partition. Monitor lag per partition, not only as one total number. A single hot partition can be hidden by an acceptable cluster-wide average.

Parallel Workers Require a Commit Watermark

A consumer may dispatch polled records to a worker pool to increase concurrency. This is dangerous when workers complete out of order.

Suppose one partition returns offsets:

200, 201, 202, 203

Workers finish in this order:

200 completed
202 completed
203 completed
201 still running

The consumer cannot safely commit through 203 because offset 201 is incomplete. It may only advance to the highest contiguous completed position.

Completed: 200, 202, 203
Missing:   201

Safe commit boundary: after 200

Once 201 finishes, the safe boundary can move past 203.

This boundary is often treated as a completion watermark. A worker-pool design must also preserve per-key ordering by routing records with the same key to the same worker or by processing each partition serially.

Adding threads without a precise offset algorithm can create silent loss.

Full End-to-End Failure Analysis

The following table summarizes the most important failure windows.

Failure point	Possible result	Required protection
Producer fails before send	Event never reaches Kafka	Application-level recovery or durable publishing workflow
Broker stores batch but acknowledgment is lost	Producer retries	Producer idempotence
Weak acknowledgment before replication	Acknowledged data can disappear with leader failure	`acks=all` with suitable in-sync replica policy
Consumer commits before processing	Unprocessed event can be skipped	Manual commit after success
Database update succeeds before offset commit	Event is processed again	Idempotent consumer operation
Consumer processes too slowly	Rebalance and repeated work	Poll sizing and timeout alignment
Consumer cannot deserialize one record	Partition can stop progressing	Explicit error handling
Group offsets expire	Consumer uses reset policy	Offset retention aligned with outage expectations
More consumers than partitions	Extra instances remain idle	Partition and capacity planning
Uneven record keys	Hot partition and rising lag	Better partition-key design

A Practical End-to-End Configuration Baseline

The following producer configuration expresses the important intent without pretending that one set of tuning values fits every system:

bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
client.id=checkout-producer

key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer

acks=all
enable.idempotence=true
compression.type=zstd

batch.size=<measured-value>
linger.ms=<measured-value>
buffer.memory=<measured-value>
delivery.timeout.ms=<measured-value>
request.timeout.ms=<measured-value>

The consumer baseline is:

bootstrap.servers=kafka-a:9092,kafka-b:9092,kafka-c:9092
client.id=fulfillment-consumer
group.id=fulfillment

key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer

enable.auto.commit=false
auto.offset.reset=earliest

max.poll.records=<measured-value>
max.poll.interval.ms=<measured-value>
session.timeout.ms=<measured-value>
heartbeat.interval.ms=<measured-value>

These are not independent switches. The final values must be validated against:

Record size
Event rate
Number of partitions
Consumer processing cost
External dependency latency
Memory limits
Required recovery time
Acceptable consumer lag
Broker and network behavior during failure

Testing the Complete Workflow

A reliable pipeline must be tested under ambiguity, not only under success.

Test 1: Producer acknowledgment loss

Publish records continuously.
Introduce a failure that interrupts acknowledgments.
Allow the producer to retry.
Verify that retry handling does not append duplicate Kafka records.
Confirm that application callbacks report final success or failure correctly.

Test 2: Partition ordering

Publish several state changes using the same order ID.
Publish unrelated orders concurrently.
Verify that one order's events remain in partition order.
Verify that unrelated orders can be processed in parallel.

Test 3: Consumer crash after database success

Process an event and complete the database update.
Stop the process before committing the offset.
Restart the consumer.
Verify that the record is delivered again.
Confirm that repeated processing leaves the database correct.

Test 4: Consumer crash after offset commit

This test should demonstrate why committing early is unsafe.

Commit progress before applying the database update.
Stop the consumer.
Restart it.
Verify that the unprocessed record is no longer returned.
Remove the unsafe commit order from the implementation.

Test 5: Slow processing and rebalance

Make event handling slower than the allowed poll interval.
Observe the consumer leave the group.
Confirm that its partitions move to another instance.
Measure duplicate work and lag.
Reduce poll size or adjust the processing model.

Test 6: Malformed payload

Publish a record that cannot be deserialized.
Verify that the error is captured with topic, partition, and offset.
Confirm that the handling policy is applied.
Ensure one bad record does not create an unexplained permanent stall.

Test 7: New consumer group

Create a group with no committed offsets.
Start it with auto.offset.reset=earliest.
Verify that it reads retained history.
Repeat with a different reset policy and confirm the intended behavior.

Test 8: Scale beyond partition count

Start one consumer per partition.
Add more consumers to the same group.
Verify that the extra members remain idle.
Confirm that capacity planning is based on partitions and processing rate.

Common Mistakes

Treating send as immediate success

The producer API is asynchronous. Success must come from the completion result, not merely from calling the method.

Selecting a key without considering ordering

The key defines which records stay together. A random identifier can break entity ordering, while a low-cardinality key can overload one partition.

Using `acks=all` without reviewing replica requirements

The acknowledgment setting depends on the cluster's in-sync replica policy. The producer and topic must be designed together.

Assuming producer idempotence makes the whole workflow exactly once

It protects against duplicates caused by producer retries. It does not make an external database update and a Kafka offset commit atomic.

Sharing a consumer group between independent services

Members of one group split partitions. Independent subscribers require different group IDs.

Keeping automatic commits for critical work

Automatic progress can move ahead of unfinished processing. Manual commit after success is safer when loss is unacceptable.

Increasing consumer count without checking partitions

Consumers beyond the number of partitions do not add useful parallelism within the same group.

Increasing timeouts instead of fixing slow processing

Long timeouts can hide an overloaded or blocked consumer. Poll size, processing design, and external calls should be measured first.

Ignoring malformed records

A single incompatible record can repeatedly block a partition. Error handling must be part of the consumer design.

Monitoring only service health

A running consumer can still be far behind. Monitor partition-level lag, rebalances, processing failures, memory, poll duration, and producer delivery errors.

Delivery Checklist

Before releasing the producer:

The topic and record contract are documented.
The entity key preserves the required ordering boundary.
Serializers match the agreed format.
Send completion is observed asynchronously.
Acknowledgments and replica requirements form one durability policy.
Producer idempotence remains enabled.
Retry and in-flight settings preserve ordering.
Batch, timeout, and memory values have been tested.
Broker addresses returned to the client are reachable.
Compression has been evaluated with realistic payloads.

Before releasing the consumer:

Each logical subscriber has its own group ID.
The reset policy matches the service's recovery purpose.
Automatic commits are disabled for critical processing.
Offsets are committed only after successful work.
Database operations tolerate repeated delivery.
Poll size fits inside the allowed poll interval.
Heartbeat, session, and poll timeouts are aligned.
Rebalance behavior has been tested.
Deserialization failures have an explicit handling path.
Offset retention supports the maximum expected outage.
Consumer lag is monitored per partition.
Scaling expectations match the number of partitions.

Conclusion

A Kafka producer-consumer pipeline is a chain of independent durability and progress decisions. The producer must serialize correctly, select a stable key, batch efficiently, request meaningful acknowledgments, and retry without duplicating records. The consumer must join the correct group, process only its assigned partitions, handle rebalances, finish business work before committing offsets, and tolerate repeated delivery around external side effects.

The safest mental model is not "Kafka delivers each business action once." It is "Kafka preserves records and positions, while applications define when publishing and processing are complete."

When producer reliability, partitioning, consumer groups, offset commits, processing time, and database behavior are designed together, failures become recoverable and explainable instead of silent and destructive.