A checkout request reaches your application and appears simple: validate the order, load the delivery address, reserve stock, charge the customer, arrange shipment, and update analytics. The first implementation often connects these capabilities with synchronous HTTP calls. That design is easy to understand while the workflow is small, but every new dependency adds another place where latency, downtime, or a deployment mismatch can block the entire request.
Moving the workflow to Kafka can remove many direct dependencies, but it does not make the hard problems disappear. The system becomes asynchronous, which means different services may temporarily hold different views of the same business entity. A producer may retry after an acknowledgment is lost. A consumer may be unavailable for a while and catch up later. An event contract may evolve while older consumers are still running.
The goal is therefore not simply to replace REST with Kafka. The goal is to design a durable event flow in which services can operate independently, failures are recoverable, message contracts are governed, and every consistency tradeoff is visible before code reaches production.
The Problem: A Workflow That Is Only as Reliable as Its Weakest Call
Consider an order workflow implemented as a chain of synchronous requests:
Client
|
v
CheckoutService
|----> CustomerProfileService
|----> InventoryService
|----> PaymentService
|----> DeliveryService
|----> AnalyticsService
|
v
HTTP response
This design has useful properties. The caller receives an immediate response, failures can be returned directly, and the control flow is easy to trace in a debugger. It is a good fit when one component genuinely needs an immediate answer from another component before it can continue.
The problems begin when the workflow grows:
- A slow downstream service increases the checkout response time.
- A temporary failure can cause the whole operation to fail even when most steps could run later.
- Every new subscriber requires another direct integration.
- Deployment order becomes more sensitive because callers depend on live downstream APIs.
- A retry can repeat work in several services unless every operation is designed carefully.
- A failure in one dependency can spread through the call chain and create a wider outage.
The architecture is tightly coupled in time. CheckoutService can proceed only while all required dependencies are reachable and responsive.
An event-driven design changes that relationship. Instead of asking every interested service to act during the checkout request, the order-owning service records the business change and publishes an event describing what happened. Other services consume that event independently.
Decide What Should Stay Synchronous
Kafka is not a universal replacement for HTTP. A practical migration starts by classifying each interaction.
| Interaction type | Typical question | Better default |
|---|---|---|
| Query | What is the current delivery price? | Synchronous request-response |
| Command requiring an immediate result | Can this payment be authorized now? | Synchronous request-response or a carefully designed asynchronous workflow |
| Business notification | An order was accepted | Event |
| Change distribution | A customer changed a delivery address | Event |
| Data replication | Copy selected records into another platform | Kafka Connect or another replication mechanism |
| Continuous transformation | Route, filter, join, or aggregate events | Stream-processing application |
A useful rule is to publish facts, not remote procedure calls disguised as messages. An event should state that something meaningful has already happened. For example, OrderAccepted describes a completed business decision. A message named CallInventoryService exposes the producer's workflow and couples the event channel to one particular consumer.
Keep direct calls when the caller must receive an immediate answer. Use events when multiple independent services need to react, when temporary consumer downtime must not block the producer, or when historical events need to be replayed.
Define the Target Architecture Before Choosing Client Settings
A minimal Kafka-based order flow can look like this:
Kafka cluster
+---------------------+
| Broker | Broker |
| Broker | Event log |
+---------------------+
^ ^ ^
| | |
publish pull pull
| | |
OrderService ----------+ | +---------- AnalyticsProjection
|
+----------------- FulfillmentService
|
+----------------- NotificationService
KRaft controllers manage cluster metadata and coordinate broker membership.
The main roles are straightforward:
- Producer: An application that sends messages to Kafka. In this workflow, OrderService publishes an event after accepting an order.
- Broker: A Kafka server that receives, persists, and serves messages. Several brokers form a cluster so data and workload can be distributed.
- Consumer: An application that subscribes to a topic and pulls messages from Kafka. Each consumer decides how to react.
- KRaft controller: A control-plane component that stores cluster metadata, tracks broker health, and coordinates changes such as broker failure handling.
The producer does not need to know how many consumers exist. A new reporting service can subscribe later without changing OrderService. A temporarily unavailable consumer can resume after recovery because the events remain stored according to Kafka's retention configuration.
This decoupling is valuable, but it changes the system's consistency model. OrderService may show an accepted order before FulfillmentService has created a shipment. Analytics may update a few moments later. That is eventual consistency: the system converges toward a shared state, but all components do not change at the same instant.
Step 1: Choose One Event That Represents a Real Business Fact
Do not begin by moving the entire workflow. Select one event with clear ownership and several useful consumers. In this example, OrderService owns the order lifecycle and publishes OrderAccepted only after it has made the decision that the order is valid.
A first contract can be described without binding it to a programming language:
Event name: OrderAccepted
Owner: OrderService
Business meaning: The seller has accepted responsibility for processing the order.
Identity: orderId
Occurred at: eventTime
Payload:
orderId
customerId
deliveryAddress
totalAmount
currency
contractVersion
Technical metadata:
traceId
The contract should answer practical questions:
- Who owns the event definition?
- What business fact does it represent?
- Which fields are required for consumers to act independently?
- Which identifier can consumers use to recognize repeated processing?
- How will consumers understand a newer version?
- Which fields belong to business data, and which belong to technical metadata?
Kafka brokers treat message content as bytes. They store and forward the message, but they do not understand whether totalAmount exists or whether its type changed. Contract governance therefore belongs outside the broker.
A Schema Registry can provide a central place to store schemas, assign versions, and check compatibility. Producers and consumers can then agree on message structure without directly depending on each other's application code.
Step 2: Remove the Temporal Dependency with Local State
Suppose FulfillmentService needs a delivery address. The synchronous design asks CustomerProfileService for the address during every order. The event-driven alternative is to let CustomerProfileService publish an address-change event whenever its state changes. FulfillmentService consumes those changes and maintains the fields it needs locally.
CustomerProfileService
|
| publishes DeliveryAddressChanged
v
Kafka
|
| consumer pulls event
v
FulfillmentService
|
v
Local delivery-address projection
The local copy lets FulfillmentService continue even when CustomerProfileService is temporarily offline. It also removes a network call from the fulfillment path.
This improvement introduces a business decision that must be explicit: how old may the local address be when shipment begins? An architect must define what happens when an order is accepted while an address update is still in transit. Possible policies include delaying shipment until related events are caught up, storing the address snapshot in OrderAccepted, or requiring the customer to confirm the delivery address during checkout.
Kafka supplies durable transport and replay. It cannot decide which consistency rule is correct for the business.
Step 3: Design for Retry and Repeated Processing
Reliable delivery requires cooperation between producers, brokers, and consumers. A producer sends data and can wait for an acknowledgment that the cluster accepted it. When the acknowledgment does not arrive, the producer may retry.
A retry creates an important ambiguity. The first attempt may have failed before Kafka stored the message, or Kafka may have stored it while the acknowledgment was lost. From the producer's perspective, both situations can look the same.
Consumers must therefore avoid assuming that processing will happen only once. A generic consumer workflow can be written as follows:
on OrderAccepted(event):
if processingLog already contains event.orderId:
acknowledge event
stop
create fulfillment work for event.orderId
record event.orderId in processingLog
acknowledge event
This is conceptual pseudocode rather than a complete transaction design. The important architectural point is that a stable business identifier gives the consumer a way to recognize work it has already completed.
The same care is needed for side effects. Sending two emails may be annoying. Reserving inventory twice or charging twice is far more serious. Each consumer should document the result of repeated delivery and decide how it prevents duplicate effects.
Step 4: Treat the Event Log as Durable History, Not a Mutable Queue
Kafka appends messages to a commit log. Once written, an event is immutable. Consumers move through the stored sequence and can replay earlier messages while those messages are still retained.
That behavior changes how corrections are handled. An incorrect event is not edited in place. The producing service publishes a new event that expresses the correction or a newer state.
Replay is useful for several tasks:
- Rebuilding a local projection after its database is lost.
- Starting a new consumer from historical events.
- Reprocessing data after consumer logic changes.
- Investigating how a service arrived at its current state.
Replay also exposes hidden assumptions. A consumer that sends an external notification every time it reads an event may send old notifications again during a rebuild. Separate state reconstruction from non-repeatable side effects, or add an explicit replay policy.
Kafka can retain events for a configured period and can also support architectures where the event history is kept for much longer. That does not turn Kafka into a general-purpose query database. Kafka is optimized around sequential event access. Services commonly build local databases or in-memory projections for business queries.
Step 5: Put Transformation Logic in the Right Place
As the number of consumers grows, they may need different versions of the same information. Imagine that the customer profile domain publishes a general AddressChanged event. Fulfillment needs only shipping addresses, while Billing needs only invoice addresses.
There are three common placements for the filtering logic:
- Inside the producer: The profile service publishes several specialized events. This can overload the producer with knowledge about downstream needs.
- Inside every consumer: Each service receives the general event and discards irrelevant records. This duplicates logic and processing.
- Inside a stream-processing layer: A separate application consumes the general event, applies routing or transformation rules, and publishes focused downstream events.
AddressChanged topic
|
v
AddressRoutingProcessor
|----------------------|
v v
ShippingAddressChanged BillingAddressChanged
| |
v v
FulfillmentService BillingService
Kafka Streams and Apache Flink are examples of frameworks that can support filtering, aggregation, joining, and stateful processing. Kafka Connect can handle simpler data movement and limited stateless transformations, but it is not intended to replace a full processing layer for complex business rules.
The architectural objective is to prevent transformation logic from being scattered across every producer and consumer.
Step 6: Use Kafka Connect When the Requirement Is Data Movement
Teams often write a custom producer to read database changes and a custom consumer to write them into another system. When the requirement is primarily replication, this creates application code that must be developed, deployed, monitored, and maintained even though it contains little business logic.
Kafka Connect offers a configuration-driven alternative. Source connectors move data from external systems into Kafka, while sink connectors move data from Kafka into target systems. Connector plugins provide system-specific integration behavior.
Use a custom application when the flow contains business decisions, domain validation, or complex state transitions. Consider Kafka Connect when the task is mainly to transfer records between Kafka and a database, warehouse, or storage platform.
This distinction keeps business logic in business services and routine integration work in an integration framework.
Step 7: Plan Operations as Part of the Architecture
A Kafka design is incomplete when it contains only producers, topics, and consumers. The cluster must be secured, monitored, scaled, upgraded, and recovered after failures. Client applications also need monitoring because a healthy broker cluster does not guarantee that consumers are processing successfully.
Before implementation, answer these questions:
- Who owns the Kafka platform?
- How will broker and consumer failures be detected?
- How will teams trace one business event across producer, broker, and consumer logs?
- What retention period is required to survive expected consumer outages?
- How much storage and network capacity will the event volume require?
- Which service-level objectives define acceptable latency and recovery?
- How will authentication, authorization, encryption, and data-at-rest protection be handled?
- Who approves event contract changes?
- How will disaster recovery be tested?
Deployment choice affects these responsibilities.
| Deployment model | Main advantage | Main responsibility or limitation |
|---|---|---|
| Self-managed Kafka | Greater control over versions, configuration, tools, and infrastructure | Your team owns provisioning, monitoring, maintenance, troubleshooting, and physical or cloud infrastructure |
| Managed Kafka service | Faster provisioning and reduced broker administration | Provider limits may affect versions, low-level tuning, ecosystem components, and management tools |
| Hybrid design | Can balance control, residency, and managed capabilities | Adds integration, networking, governance, and operational complexity |
The right choice depends on available skills, regulatory constraints, required control, cost, and the ability to operate a distributed platform over time.
Test the Architecture Before Expanding It
A proof of concept should test failure behavior, not only the successful path. Use a small but realistic workflow and verify the following scenarios.
Consumer outage
- Stop FulfillmentService.
- Publish several accepted orders.
- Restart the consumer.
- Verify that it resumes and processes the stored events.
Lost acknowledgment or producer retry
- Simulate a send whose result is uncertain.
- Allow the producer to retry.
- Verify that downstream processing does not create duplicate business effects.
Broker failure
- Run a multi-broker test cluster with replication.
- Make one broker unavailable.
- Confirm that the cluster continues serving data according to the selected reliability settings.
Consumer rebuild
- Remove the consumer's local projection.
- Replay retained events.
- Verify that the rebuilt state matches the expected current state.
- Confirm that non-repeatable side effects are not triggered again.
Contract evolution
- Introduce a compatible event version.
- Run old and new consumers at the same time.
- Verify that both can process the messages they are expected to understand.
Delayed and out-of-order business updates
- Delay a profile-change event while allowing a later order event to arrive.
- Observe how the consumer handles temporarily stale state.
- Confirm that the result follows the documented business policy.
These tests turn abstract claims about decoupling and reliability into observable system behavior.
Common Mistakes
Replacing every REST call with an event
Some interactions are queries or commands that require immediate outcomes. Forcing them into an asynchronous flow can make the user experience and error handling unnecessarily complex.
Assuming decoupling removes consistency problems
Event-driven systems trade synchronous coordination for eventual consistency, retry handling, duplicate processing, and possible ordering concerns. These are design responsibilities, not edge cases.
Letting producers depend on specific consumers
A producer should publish a reusable business fact. When its payload and publishing logic are shaped around one subscriber, direct coupling has merely moved into the event contract.
Expecting Kafka brokers to validate business payloads
Brokers store bytes and do not enforce domain rules. Use governed contracts, client-side validation, and a schema-management strategy.
Treating replay as harmless
Rebuilding state is different from repeating side effects. Consumers must distinguish replay-safe calculations from actions such as charging, emailing, or calling external systems.
Using Kafka as a drop-in replacement for a database
Kafka is valuable for durable event history and state reconstruction, but business queries usually require a projection, index, cache, or database designed for that access pattern.
Postponing operational design
A working demo does not answer who monitors lag, owns incidents, controls access, manages schemas, estimates storage, or restores service after a major failure.
Architecture Review Checklist
Before approving the migration, confirm that the design answers every item below:
- [ ] The first event represents a completed business fact with a clear owner.
- [ ] Interactions that require immediate answers remain synchronous where appropriate.
- [ ] Producers do not need to know which consumers are running.
- [ ] Consumers can tolerate temporary downtime and resume from retained events.
- [ ] Every consumer has a strategy for repeated delivery and repeated side effects.
- [ ] The acceptable window of eventual consistency is defined by the business.
- [ ] The handling of stale or unexpectedly ordered updates is documented.
- [ ] Event schemas are versioned, governed, and compatible with deployed consumers.
- [ ] Kafka Connect is considered for data replication without custom business logic.
- [ ] Stream processing is considered for shared routing, filtering, aggregation, or joins.
- [ ] Retention supports recovery and replay requirements.
- [ ] Local projections are used where services need efficient business queries.
- [ ] Monitoring covers brokers, producers, consumers, and end-to-end event flow.
- [ ] Security, deployment ownership, capacity, cost, and disaster recovery are planned.
- [ ] Failure scenarios are tested before more workflows are migrated.
Conclusion
Kafka can break a fragile chain of synchronous dependencies by allowing one service to publish a durable event and many independent consumers to react. Its broker cluster, persistent commit log, replay capability, and publish-subscribe model provide a strong foundation for event-driven systems.
The improvement is not automatic. The architecture must define which interactions remain synchronous, who owns each event, how contracts evolve, how consumers handle retries, what eventual consistency means to the business, where transformations belong, and how the platform will be operated.
Start with one meaningful event, one end-to-end flow, and a failure-focused proof of concept. Expanding Kafka after those decisions are explicit is far safer than placing a broker between services and discovering the real architecture only after production incidents begin.