Preventing Out-of-Order Inventory Updates and Data Loss in Kafka Topic Design

A warehouse platform may begin with a simple requirement: publish every inventory change so fulfillment, replenishment, analytics, and customer-facing services can react quickly. The first proof of concept often works with one topic, one producer, and one consumer. Problems appear later, when traffic grows and several consumer instances process updates concurrently.

A stock increase for an item may be applied after a newer stock decrease. One partition may become overloaded while others remain nearly idle. A broker failure may make recent updates unavailable. A recovering consumer may discover that the events it needs have already expired. These failures are not caused by Kafka being unreliable. They usually come from topic design decisions that did not match the business rules.

This tutorial designs a Kafka data architecture for inventory updates. The goal is to preserve update order for each product, allow unrelated products to be processed in parallel, maintain availability during broker failures, and choose an appropriate policy for event retention or latest-state storage.

The System We Need to Design

Assume an inventory platform has these components:

InventoryService accepts stock adjustments from warehouses and publishes inventory events.
FulfillmentService consumes those events before confirming orders.
ReplenishmentService watches low-stock changes and schedules restocking.
InventoryViewService maintains a queryable projection for dashboards and APIs.
A Kafka cluster transports and stores the records.

The high-level flow looks like this:

Warehouse clients
       |
       v
InventoryService
       |
       | publishes stock changes
       v
Kafka topic: inventory-updates
       |
       +--------------------+
       |                    |
       v                    v
FulfillmentService   ReplenishmentService
       |
       v
InventoryViewService

The critical business rule is simple:

Changes for the same stock item must be processed in the order in which Kafka stores them.

At the same time, updates for different products do not need a global order. A change for product SKU-101 can be processed independently from a change for SKU-902.

That distinction determines the partitioning strategy.

Why One Topic Is Not One Ordered Queue

A Kafka topic is a logical destination for related records. For scalability, the topic is divided into partitions. Each partition is an append-only sequence of immutable records.

Kafka preserves ordering inside a partition. It does not provide one global order across all partitions in a topic.

Consider a topic with three partitions:

inventory-updates

Partition 0: [A1] [A2] [C1]
Partition 1: [B1] [B2]
Partition 2: [D1] [D2] [D3]

Records inside partition 0 have a stable sequence. Records across partition 0 and partition 1 do not share a meaningful combined order because the partitions can be consumed concurrently.

This is useful rather than limiting. It lets the system process independent groups of records in parallel. The design task is to decide which records belong together.

Select a Key That Represents the Ordering Boundary

Kafka uses a record key to select a partition. Records with the same key are routed to the same partition when the producer uses key-based partitioning.

For inventory data, a product identifier alone may not be enough. The same product can exist in multiple warehouses, and each location can have its own stock level. A better key is the combination of warehouse and product:

warehouseId + ":" + productId

Example keys:

GOT-01:SKU-101
GOT-01:SKU-902
STH-03:SKU-101

This key defines the ordering boundary. Every stock change for GOT-01:SKU-101 reaches the same partition and can be processed sequentially.

An event payload might conceptually contain:

Key:
GOT-01:SKU-101

Value:
eventType = STOCK_ADJUSTED
quantityDelta = -3
recordedAt = 2026-06-20T09:15:00Z

Headers:
trace-id = 7af3c1
event-version = 1

The key is not a uniqueness constraint. Many records should share the same key because each record represents another state change for the same inventory item.

What Happens with a Poor Key

Suppose the team chooses only warehouseId as the key. If one warehouse produces most of the traffic, nearly all its records land in one partition. That partition becomes a hot spot while other partitions remain underused.

Suppose the team uses a random identifier as the key. Load distribution may improve, but updates for the same item can reach different partitions. Consumer instances may then apply them out of order.

A practical key must satisfy two goals:

Related records that require ordering must share the same key.
The key space should distribute traffic reasonably across partitions.

Use Partitions to Control Consumer Parallelism

Partitions are also units of consumer parallelism. Within one consumer group, each partition is assigned to only one active consumer at a time.

Assume FulfillmentService runs three instances and the topic has six partitions:

Partition 0 -> Fulfillment instance A
Partition 1 -> Fulfillment instance A
Partition 2 -> Fulfillment instance B
Partition 3 -> Fulfillment instance B
Partition 4 -> Fulfillment instance C
Partition 5 -> Fulfillment instance C

Each partition is processed by one group member, but different partitions can be processed concurrently. Because all updates for one inventory key stay in one partition, they also stay with one active consumer at a time.

Adding consumer instances beyond the partition count does not increase parallel processing for that consumer group. If the topic has six partitions, at most six consumers in the same group can receive active assignments.

Estimating a Starting Partition Count

A practical estimate can be based on target throughput:

requiredPartitions = max(
    targetThroughput / producerThroughputPerPartition,
    targetThroughput / consumerThroughputPerPartition
)

This is only a starting point. Producer batching, message size, consumer processing time, broker capacity, and network conditions all influence the result. Performance tests are still necessary.

Avoid choosing an unnecessarily high number. Every partition adds metadata, files, replication work, and recovery overhead. More partitions can increase throughput, but they also increase operational cost.

Changing the partition count later is possible only in one direction: partitions can be added, not reduced without recreating the topic. Adding partitions can also change the key-to-partition mapping for new records. When strict per-key ordering matters across the change, increasing the count requires a controlled migration plan.

Replicate Partitions to Survive Broker Failures

Partitions provide parallelism. Replicas provide redundancy.

Each partition has one leader replica and one or more follower replicas. Producers normally write to the leader. Followers copy the leader's records.

A three-broker layout might look like this:

Partition 0
  Broker 1: leader
  Broker 2: follower
  Broker 3: follower

Partition 1
  Broker 2: leader
  Broker 3: follower
  Broker 1: follower

Partition 2
  Broker 3: leader
  Broker 1: follower
  Broker 2: follower

If a broker fails, an eligible follower can take over as leader. This keeps the partition available as long as the cluster still has a suitable replica.

The replication factor defines how many copies of each partition exist. A replication factor of 1 means there is no redundant copy. If the broker holding that partition becomes unavailable, the data is unavailable and may be lost.

For important inventory data, the design should explicitly define both:

replication.factor
min.insync.replicas

The second value specifies how many in-sync replicas must be available for a write to succeed.

A common durable baseline for a three-broker cluster is:

replication.factor = 3
min.insync.replicas = 2

This configuration means each partition has three replicas, and at least two in-sync replicas must be available for an accepted write when the producer uses the strongest acknowledgment mode.

Replication factor and minimum in-sync replicas solve different problems:

Setting	Architectural purpose
Replication factor	Controls how many copies of each partition exist
Minimum in-sync replicas	Controls how many sufficiently current replicas must be available for a successful write

A high replication factor does not automatically guarantee strong durability. Producer acknowledgment behavior must agree with the topic's replica requirements.

Understand What Kafka Stores

Kafka records are immutable. A producer cannot modify a record already written at a particular position. Corrections are represented by later records.

Each record can be uniquely located by:

topic + partition + offset

The offset is a growing position inside one partition. Different partitions have independent offset sequences.

For example:

inventory-updates, partition 2, offset 18452

Offsets help consumers track progress. They are not business identifiers and should not replace an event identifier when the domain requires deduplication or audit tracking.

Kafka stores partition data in segment files on broker disks. One segment is active for new appends, while older segments are closed. Index files allow Kafka to locate records by offset or timestamp without scanning the entire partition log.

This storage model explains several Kafka characteristics:

Sequential appends are efficient.
Old data is removed at the segment level according to retention rules.
Individual historical records are not updated in place.
Reading by arbitrary business fields is not Kafka's primary access pattern.

Choose Between Event Retention and Latest-State Retention

Inventory data can be represented in two useful ways:

A complete stream of stock changes.
The latest known stock state for every inventory key.

These are different requirements and may justify different topics.

Option 1: Retain the Event Stream

An event topic records each change:

+10 units received
-2 units reserved
-1 unit shipped
+1 unit reservation released

This history is useful for replay, auditing, rebuilding projections, and diagnosing incorrect state.

Time-based retention should be longer than the maximum expected consumer outage, with additional safety margin. Kafka can remove data whether or not every consumer has processed it, so retention must be chosen from recovery requirements, not convenience.

Size-based retention limits storage more predictably, but traffic spikes can shorten the effective history. For critical inventory changes, time-based retention is generally easier to align with recovery expectations.

Option 2: Use a Compacted Topic for Current State

A compacted topic retains at least the latest value for each key. Earlier values may remain temporarily until background compaction rewrites older closed segments.

Example:

Key: GOT-01:SKU-101
Value: availableQuantity = 42

Later:

Key: GOT-01:SKU-101
Value: availableQuantity = 39

After compaction eventually processes the old data, Kafka guarantees that the latest value remains available for that key.

Compaction has important limits:

Access is still sequential, not random key lookup.
Cleanup does not happen immediately.
Keys cannot be null.
A deletion is represented by a tombstone record, which has a key and a null value.
Consumers rebuilding state must read the topic and keep the newest observed value per key.

For this architecture, a useful separation is:

inventory-events
  cleanup policy: delete
  purpose: replayable business history

inventory-current-state
  cleanup policy: compact
  purpose: rebuildable latest inventory projection

Trying to make one topic satisfy every historical and current-state requirement can create unclear retention behavior.

Document the Topic Contract

Topic names and schemas are only part of the architecture. Teams also need to document partition count, replication, keys, cleanup policy, and business purpose.

An adapted AsyncAPI-style definition could look like this:

asyncapi: 3.0.0
info:
  title: Inventory Event Platform
  version: 1.0.0

channels:
  inventoryUpdates:
    address: inventory-updates
    description: Ordered stock changes grouped by warehouse and product.
    messages:
      stockAdjusted:
        bindings:
          kafka:
            key:
              type: string
              description: Warehouse and product composite identifier
    bindings:
      kafka:
        partitions: 12
        replicas: 3
        topicConfiguration:
          min.insync.replicas: 2
          cleanup.policy:
            - delete

The partition number in this example is only a design placeholder. A real project should justify it through throughput estimates and tests.

Keeping this definition in version control makes architectural changes reviewable. It also reduces the chance that producer, consumer, platform, and operations teams work from different assumptions.

Test the Design Before Production

Topic design should be verified with behavior-focused tests, not only configuration review.

Ordering Test

Publish several updates for the same inventory key.
Run multiple consumer instances in one group.
Confirm that all updates for that key are processed in Kafka order.
Repeat with many unrelated keys to verify parallelism.

Distribution Test

Publish representative traffic using the proposed key.
Compare record counts and processing lag across partitions.
Check for hot partitions.
Revisit the key if a small number of values dominate the workload.

Broker Failure Test

Publish records continuously.
Stop the broker leading several partitions.
Confirm that new leaders are elected.
Verify producer behavior while replica availability changes.
Confirm that acknowledged records remain readable.

Consumer Recovery Test

Stop a consumer long enough to build a backlog.
Restart it before the retention window expires.
Confirm that it resumes from its stored offsets.
Test what happens when the outage exceeds the planned recovery period.

Compaction Test

Publish multiple values for the same key.
Publish tombstones for deleted inventory entries.
Rebuild a fresh in-memory state by consuming from the beginning.
Verify that the final state is correct even when older records are still present.

Common Design Mistakes

Expecting Global Ordering

Kafka orders records inside a partition, not across an entire multi-partition topic. A workflow that requires one global sequence may conflict with the desired level of parallelism.

Selecting a Key Only for Uniqueness

A unique event identifier spreads records but does not group related state changes. The key should normally represent the entity or consistency boundary whose updates must remain ordered.

Increasing Partitions Without an Ordering Plan

Adding partitions changes how keys may map to partitions for future records. A topic that depends on long-lived per-key ordering needs a deliberate migration strategy.

Treating Replication as a Complete Durability Policy

Replication factor, minimum in-sync replicas, and producer acknowledgments must work together. Configuring only one of them leaves gaps in the guarantee.

Using Compaction as a Query Database

A compacted topic preserves latest values, but consumers still read sequentially. Build a local projection or use a database when the application needs efficient business queries.

Choosing Retention Without Consumer Recovery Requirements

Retention that is shorter than a realistic outage can make recovery impossible. Retention should be connected to service-level objectives and recovery procedures.

Architecture Checklist

Before approving a Kafka topic for inventory events, confirm the following:

The topic has one clearly defined business purpose.
The partition key matches the required ordering boundary.
The key distribution has been tested with realistic data.
The partition count supports expected producer and consumer throughput.
The design acknowledges that ordering exists only within a partition.
Replication factor and minimum in-sync replicas are explicitly configured.
Producer acknowledgment requirements match the durability goal.
Retention exceeds the expected consumer recovery window.
Compaction is used only when latest-per-key state is required.
Tombstone behavior is tested for deleted state.
Topic metadata and contracts are stored in version control.
Broker failure, consumer recovery, and replay have been tested.

Conclusion

A reliable Kafka architecture begins with business invariants, not broker settings. For an inventory pipeline, the most important invariant is usually ordered processing for each warehouse and product combination. That rule leads directly to a stable composite key, which determines partition placement and consumer behavior.

Partitions then provide parallelism, replicas provide fault tolerance, retention supports recovery, and compaction can preserve current state. When these mechanisms are designed together, Kafka becomes more than a transport channel. It becomes a durable event backbone whose behavior matches the system's consistency, scalability, and recovery requirements.