Preventing a Kafka-First Customer Data Platform from Breaking Task Queues, Batch Loads, and Globally Ordered Workflows

A platform team is building a unified customer data system for an online retailer. The first proposal is attractive: send every important change to Kafka, connect every service to it, and use the same platform for database replication, centralized logs, fraud detection, background tasks, daily warehouse exports, and large document transfers.

The proposal reduces the number of technologies on the architecture diagram, but it creates a more dangerous problem. Kafka's strengths can hide its constraints. A design that works well for high-rate event streams may become awkward for one-worker task queues, strict global ordering, random data lookup, large files, or time-bounded batch jobs.

The goal is not to prove that Kafka is always the best choice or always unnecessary. The goal is to match each workload to the behavior it actually needs. This tutorial walks through an architecture review for a customer data platform and shows how to keep Kafka where its log-based, partitioned, publish-subscribe model provides real value.

The Platform Under Review

The retailer wants to combine information from several systems:

CustomerService owns customer profile data.
OrderService publishes purchases and order status changes.
PaymentService publishes successful and rejected payments.
SupportService records customer interactions.
Operational databases contain data that older applications cannot publish directly.
Every service produces logs and operational metrics.
A fraud component must react to suspicious transactions quickly.
A recommendation component needs recent browsing and purchase activity.
A data warehouse receives a daily bulk export.
Worker services execute email, document, and image-processing jobs.

The initial Kafka-first design looks like this:

CustomerService ---------+
OrderService ------------+
PaymentService ----------+----> Kafka ----> Customer data projection
SupportService ----------+       |  |
Legacy databases --------+       |  +----> Fraud detection
Application logs --------+       +-------> Logging platform
Worker tasks ------------+
Daily warehouse batches -+
Large documents ---------+

The diagram is simple because everything appears to use the same transport. The simplicity is deceptive. These workloads have different requirements for fan-out, ordering, routing, storage, failure handling, data size, latency, and access patterns.

The architecture review must answer one question for each workload:

Does Kafka's native operating model match the required behavior, or would the team need to build substantial logic around it?

Start with Kafka's Actual Operating Model

Kafka is a durable, partitioned event log. Producers append records to topics, and consumers pull records from partitions. Several independent consumer groups can read the same events without competing with one another.

Its essential characteristics are:

Publish-subscribe delivery
Durable storage on disk
Partition-based parallelism
Ordering inside one partition
Sequential consumption
Retention based on time, size, or compaction
Client-side processing and validation
High throughput for streams of relatively small records
Replay while retained data remains available

Kafka brokers deliberately avoid interpreting business payloads. They store bytes, replicate partitions, and serve records to clients. Content-based routing, validation, enrichment, aggregation, and business decisions belong in producers, consumers, connectors, or stream-processing applications.

That design is powerful when the workload is naturally a stream of immutable facts. It becomes less natural when the broker itself must choose one worker, inspect content, retrieve a record by business field, or treat a group of records as a closed batch.

Workload 1: Fan-Out Customer Changes to Independent Services

When a customer changes an address, several services may need the update:

Fulfillment needs the latest delivery address.
Billing may need invoice information.
Customer support needs a current profile.
Analytics needs the change for reporting.

This is a strong Kafka use case because one event can be retained and consumed independently by several groups.

CustomerService
      |
      | customer profile changed
      v
Kafka topic: customer-profile-events
      |
      +--------------------+--------------------+
      |                    |                    |
      v                    v                    v
Fulfillment group    Billing group      Support projection group

The customer identifier should be used as the record key when updates for one customer must remain ordered. Kafka then routes records with that key to the same partition.

Each consumer group maintains its own progress. If the support projection is temporarily offline, billing can continue processing. When support returns, it can resume from its committed position while the records remain within the topic's retention window.

Why Kafka fits

Kafka matches the requirement because:

One event must reach multiple independent subscribers.
Subscribers should be able to fail or scale independently.
Events need durable buffering.
Per-customer ordering is sufficient.
New consumers may need to replay retained history.

The decision to make about state

The team must decide whether the topic represents a temporary event stream or a rebuildable latest-state log.

With normal time-based retention, downstream services need their own persistent storage. Once old records expire, Kafka no longer contains the full history required to reconstruct state.

With compaction, Kafka retains at least the latest value for each key. A consumer can rebuild a current customer projection by reading the compacted topic from the beginning. However, access remains sequential. Compaction does not turn Kafka into a database that can instantly fetch one customer by identifier.

A practical design is:

Kafka:
  durable event transport and replay

Consumer database:
  indexed customer projection for API queries

Kafka distributes changes. A database serves random business queries.

Workload 2: Replicate Legacy Database Changes

Some older systems cannot publish business events. Their databases still contain information needed by the customer platform.

Kafka Connect can run source connectors that read from external systems and publish records into Kafka. Sink connectors can read Kafka topics and write into databases, object storage, or analytical systems.

Legacy database
      |
      | source connector
      v
Kafka Connect
      |
      v
Kafka topics
      |
      +---------------------+
      |                     |
      v                     v
Customer projection    Object storage
      ^                     ^
      | sink connector      | sink connector
      +---------------------+

This is useful when the integration is mostly continuous data movement rather than custom business processing.

Where Kafka Connect helps

Kafka Connect can reduce repeated integration code by handling:

Source polling
Kafka production
Kafka consumption
Common data conversion
Connector execution and scaling
Continuous transfer to target systems

It is a good candidate when several systems need the same replicated data and Kafka already acts as the central distribution layer.

Where the apparent simplicity ends

A connector usually exposes lower-level source changes, not necessarily meaningful domain events. A database row update may tell consumers which columns changed, but it may not explain why the business operation happened.

The platform team must also account for:

An additional Kafka Connect cluster to operate
Connector licensing and support costs
Source and target outages
Error recovery
Sensitive data exposure
Table or schema changes
Weaknesses in complex transformation support

Basic field filtering or masking can fit inside a connector pipeline. Joining several records, maintaining state, or applying advanced business rules usually belongs in a stream-processing application or a dedicated service.

The architecture should separate data movement from business interpretation:

Database change
      |
      v
Source connector
      |
      v
Raw change topic
      |
      v
Transformation or enrichment service
      |
      v
Business event topic

This extra step prevents every consumer from coupling directly to the physical structure of a legacy table.

Workload 3: Buffer High-Volume Application Logs

A distributed application produces logs from many services and instances. Sending each service directly to the final indexing system makes applications dependent on that system's availability and ingestion capacity.

Kafka can act as a durable buffer:

Applications
     |
     v
Log collectors
     |
     v
Dedicated Kafka log topics
     |
     v
Kafka Connect sink
     |
     v
Elasticsearch or another log index
     |
     v
Search and dashboards

Applications can write through standard logging frameworks while agents such as Fluentd collect and forward log files. This keeps Kafka-specific producer logic out of business code.

Kafka helps because it can absorb traffic bursts and allow the indexing system to catch up after an interruption. It also decouples the application fleet from a particular log-analysis product.

Why logging may need a separate cluster

Log traffic has different priorities from business events.

Business events may prioritize durability and low latency because losing a payment or inventory update is unacceptable. Logging commonly prioritizes sustained throughput and large-volume buffering. The tolerance for losing a small amount of diagnostic data may also differ from the tolerance for losing financial data.

Placing both workloads on one cluster can allow a log spike to compete with critical event streams for disk, network, memory, and broker capacity.

A dedicated logging cluster provides:

Resource isolation
Independent capacity planning
Retention policies suited to logs
Tuning focused on throughput
Reduced risk to business-event latency

For a small application with modest log volume, Kafka may be excessive. Direct collection agents or a simpler centralized logging path can be easier to operate.

Workload 4: Detect Fraud and Produce Real-Time Decisions

A fraud detector consumes transaction events, evaluates them, and publishes suspicious results.

Payment events
      |
      v
Kafka topic: payment-events
      |
      v
FraudDetectionService
      |
      | filters, enriches, evaluates
      v
Kafka topic: suspicious-payments
      |
      v
Alerting and investigation systems

The fraud application is both a consumer and a producer. Kafka transports and stores the records, but it does not perform the fraud calculation inside the broker.

Kafka Streams, Apache Flink, or another processing framework can provide higher-level operations for filtering, transforming, aggregating, joining, and maintaining state over event streams.

Why Kafka fits

This workload benefits from:

Low-delay reaction to new data
Replay after a logic correction
Independent downstream consumers
Partition-based scaling
Durable input and output topics
A clear separation between transport and processing

What Kafka does not provide by itself

Kafka alone does not:

Inspect transaction content
Decide whether a transaction is fraudulent
Join transaction data with customer history
Maintain analytical state
Route messages according to business fields

Those responsibilities belong in the processing layer.

Real-time processing also adds costs:

Developers need streaming-specific knowledge.
Stateful processing requires recovery planning.
Distributed debugging is harder than debugging a local batch.
Tuning memory, disk, and parallelism takes testing.
Tooling may be less mature than conventional application tooling.
Not every language ecosystem has equal framework support.

Real-time processing is justified only when the decision loses value if delayed. A daily marketing report does not need the same architecture as fraud prevention.

Workload 5: Assign One Job to One Worker

The same platform also needs background tasks such as sending one email, resizing one image, or generating one document. Each task should be completed by one available worker.

This is naturally a point-to-point queue problem.

Kafka uses a publish-subscribe model. Consumer groups can divide partitions among workers, but the abstraction remains partition-oriented rather than individual-message queue routing. Building exact worker assignment, broker-side routing, per-message acknowledgment semantics, or priority behavior can require awkward application logic.

RabbitMQ is often a better candidate when the core requirement is:

One message should go to one worker.
The broker should route messages using rules.
Different queues represent different work destinations.
Complex routing or task distribution matters more than replayable history.

RabbitMQ uses exchanges, bindings, and queues:

Producer
   |
   v
Exchange
   |
   +---- binding ----> Email queue ----> Email worker
   |
   +---- binding ----> Image queue ----> Image worker
   |
   +---- binding ----> Document queue -> Document worker

The broker can route according to configured rules. This is a different philosophy from Kafka's smart clients and simple broker data path.

Kafka can support worker-style consumption in some designs, but the team should not choose it merely because Kafka is already available. The operational convenience of one platform does not automatically compensate for a mismatch in messaging semantics.

Workload 6: Upload a Daily Data Warehouse Batch

The warehouse team wants every day's transactions collected and uploaded in one bulk operation.

Kafka can supply the source records, but it does not naturally define a closed daily batch. Events form a continuous stream, can arrive late, and are split across partitions.

A consumer building a daily batch must answer:

How are records gathered from every partition?
Which event time defines the day?
When is the batch considered complete?
How are late events handled?
How are partial upload failures retried?
How are only failed records selected when Kafka is read sequentially?

A conceptual implementation becomes more complicated than the requirement suggests:

Continuous transaction stream
      |
      v
Collect records across partitions
      |
      v
Apply time boundary
      |
      v
Wait for late events or close batch
      |
      v
Bulk upload
      |
      +---- partial failures require custom recovery

If the business genuinely needs a once-per-day transfer and does not require immediate processing, a conventional extract-transform-load process may be simpler and less expensive.

Kafka can still feed the warehouse, especially when the same events already support real-time consumers. The mistake is treating Kafka as proof that the warehouse load itself must become a streaming application.

Use batch tools when the workload is fundamentally batch-oriented. Use streaming when incremental results provide business value.

Workload 7: Send Large Documents Through Kafka

The retailer wants to transfer invoices, images, and generated reports between services. These files may be much larger than ordinary event records.

Kafka is optimized for streams of relatively small messages. Large records consume producer memory, consumer memory, network bandwidth, and replication capacity. Increasing size limits also requires coordinated changes across brokers, topics, producers, consumers, and replica fetching.

The preferred architecture stores the content elsewhere and sends a reference:

Producer
   |
   | store document
   v
Object or file storage
   |
   | return document reference
   v
Kafka event containing:
  - document identifier
  - storage reference
  - type
  - checksum or relevant metadata
   |
   v
Consumer retrieves document from storage

This pattern keeps Kafka records small and leaves large binary storage to a system designed for it.

The tradeoff is lifecycle coordination. If the event expires but the file remains, storage can accumulate orphaned objects. If the file is deleted too early, a consumer replay cannot retrieve it. Retention and cleanup must therefore be designed across both systems.

Splitting one large message into fragments is possible, but it creates more complexity. All fragments must remain together, ordering must be preserved, failed consumers must recover partially received data, and the application must reassemble the content correctly.

Use fragmentation only when the reference pattern is impossible and the added failure modes are acceptable.

Workload 8: Enforce One Global Order

The business plans a limited-stock sale and asks for every purchase request to be processed in one exact order across all regions.

Kafka guarantees ordering inside a partition. It does not provide one global order across several partitions.

A single-partition topic creates one sequence:

Partition 0:
request 1 -> request 2 -> request 3 -> request 4

But it also removes partition-level parallelism and limits scalability for that workflow.

Several producers introduce another issue. Kafka preserves broker arrival order. Network delay can cause a request created earlier to arrive after a later request from another producer.

Producer A creates event A first
Producer B creates event B later

Network delay:
B reaches Kafka first
A reaches Kafka later

Kafka order:
B -> A

If business fairness depends on a globally authoritative sequence, the system needs a deliberate sequencing design. Possible approaches include:

One serialized ingress path
A single-partition topic with accepted throughput limits
Explicit sequence metadata assigned by an authoritative component
A downstream reordering stage based on defined event-time rules

Sequence numbers can help consumers reconstruct order, but they do not make Kafka globally order events at ingestion. The ordering authority must exist somewhere in the architecture.

Workload 9: Query Events by Business Field

A support API needs to retrieve all activity for one customer, filter purchases by date, and search by email address.

Kafka consumers read partitions sequentially. They can seek using offsets or timestamps, but Kafka does not index arbitrary payload fields.

The following access pattern is a poor fit for Kafka as primary query storage:

Find all events where:
  customerEmail = a specific value
  purchaseAmount exceeds a threshold
  supportCategory matches a phrase

A consumer would have to read and deserialize records until it found matching content.

The correct design projects events into an indexed datastore:

Kafka event stream
      |
      v
Projection consumer
      |
      v
Relational, document, or search database
      |
      v
Customer support API

Kafka remains the event backbone. The database provides random access, filtering, indexing, and query performance.

Compare Kafka, RabbitMQ, Pulsar, and Managed Services

No comparison table can replace testing, but it can expose architectural mismatches early.

Requirement	Kafka	RabbitMQ	Apache Pulsar	Managed event service
Replayable event history	Strong fit	Depends on queue or stream model	Strong fit	Product-dependent
High-rate publish-subscribe streams	Strong fit	Possible, but traditional queues have a different focus	Strong fit	Often strong
Complex broker-side routing	Limited	Strong fit through exchanges and bindings	Depends on design and features	Product-dependent
One-task-to-one-worker queue	Possible but not the natural model	Strong fit	Supported through subscription models	Product-dependent
Separate compute and storage scaling	Traditional brokers combine both	Not the primary model	Core architectural feature	Often abstracted by provider
Low operational responsibility	Requires self-management unless managed	Requires self-management unless managed	Requires self-management unless managed	Strong fit
Kafka client compatibility	Native	Different protocol and clients	Different clients and APIs	Some services support Kafka protocol

Apache Pulsar is architecturally close to Kafka. It uses partitioned topics and publish-subscribe communication, but separates stateless brokers from persistent storage managed by Apache BookKeeper. This can make compute and storage scaling more independent.

Kafka brokers traditionally handle both client communication and partition storage. That difference affects recovery, resource management, and operational design.

Managed cloud services can reduce provisioning and cluster-maintenance work. The tradeoff is less control, platform-specific behavior, pricing models, quotas, and possible ecosystem differences. Protocol compatibility does not guarantee identical semantics.

A Decision Workflow for New Use Cases

The platform team can use a small architectural decision process before creating a topic.

function choosePlatform(workload):
    if workload.requiresContentIndexedQueries:
        return "Database or search datastore"

    if workload.containsLargeBinaryPayloads:
        return "External object storage plus event reference"

    if workload.requiresOneMessageForOneWorker:
        return "Queue-oriented broker"

    if workload.requiresStrictGlobalOrder:
        return "Central sequencing design, possibly one partition"

    if workload.isScheduledBatchWithoutRealTimeValue:
        return "ETL or batch-processing tool"

    if workload.requiresReplayableFanOutOrContinuousStreaming:
        return "Evaluate Kafka or Pulsar"

    return "Compare the simplest viable options"

The result is not an automatic technology selection. It is a prompt to examine the requirement before reusing the existing platform.

For workloads that remain Kafka candidates, continue with these questions:

How many independent consumer groups need each event?
Which entity defines the ordering key?
What throughput and latency are required?
How long must consumers be able to recover and replay?
Is compaction required for current state?
Where will searchable projections be stored?
Does processing require joins, windows, or state?
Are message sizes compatible with Kafka?
Can the team operate brokers, connectors, and processing applications?
Would a managed service reduce risk more than it reduces control?

The Revised Architecture

After the review, the platform uses different tools for different jobs:

Business microservices
      |
      v
Kafka business-event cluster
      |
      +----> Customer projection database
      +----> Fraud stream processor
      +----> Recommendation processor
      +----> Analytical consumers

Legacy databases
      |
      v
Kafka Connect
      |
      v
Raw integration topics
      |
      v
Transformation services
      |
      v
Business-event topics

Application logs
      |
      v
Log collectors
      |
      v
Separate Kafka logging cluster
      |
      v
Search and dashboard platform

Background tasks
      |
      v
Queue-oriented broker
      |
      v
Worker pools

Large documents
      |
      v
Object storage
      |
      v
Kafka carries references only

Daily warehouse workload
      |
      v
Batch or ETL pipeline

The architecture contains more than one product, but each component has a clear responsibility. Kafka remains central without becoming a forced solution for every communication problem.

Common Kafka-First Mistakes

Choosing Kafka because it is already deployed

Existing expertise and infrastructure are valid factors, but they should not override a major semantic mismatch.

Treating every asynchronous operation as event streaming

A durable business event, a background command, a log line, and a daily batch record are all asynchronous data, but they do not require the same behavior.

Expecting brokers to route or validate business content

Kafka brokers do not deserialize payloads. Routing, validation, and transformation require clients or processing applications.

Using Kafka as a searchable database

Retention and replay do not provide indexes on business fields. Build a projection in an appropriate datastore.

Ignoring retention when consumers can be offline

Kafka can delete records after retention expires even when a consumer has not processed them. Recovery objectives must determine retention.

Sending large files as ordinary events

Large payloads affect producers, brokers, replicas, networks, and consumers. Store the content externally and publish a reference when possible.

Assuming partitions provide global order

Partitions provide parallelism and local order. Global order requires sacrificing parallelism or introducing an explicit sequencing mechanism.

Adding real-time processing where batch results are sufficient

Streaming adds operational and development complexity. Use it when low-delay results create real value.

Mixing log traffic with critical business events without isolation

Log volume can compete with business workloads. Separate clusters may be appropriate because reliability, retention, throughput, and latency goals differ.

Architecture Review Checklist

Before approving Kafka for a workload, confirm:

The data is naturally represented as an append-only event stream.
Multiple consumers need independent access or replay.
Partition-based parallelism matches the processing model.
Per-key ordering is sufficient.
The payloads are reasonably small.
Sequential access is acceptable.
Searchable state will be projected into another datastore.
Retention supports the maximum consumer outage.
Processing logic belongs outside the broker.
Batch boundaries are not being forced onto a continuous stream.
The workload does not require complex broker-side routing.
Operational costs for Kafka, Connect, and stream processing are justified.
Schema changes and sensitive data exposure have a governance plan.
Alternatives have been compared using functional and nonfunctional requirements.

Conclusion

Kafka is an excellent foundation for replayable event distribution, high-rate data integration, log buffering, and real-time processing. Its value comes from a deliberate architecture: durable partitions, independent consumer groups, sequential reads, and simple brokers that leave business logic to clients.

Those same characteristics create clear boundaries. Kafka is not a content-indexed database, a large-file store, a naturally global queue, or a batch scheduler. It can be adapted to some of those jobs, but the surrounding code and operational complexity may erase the benefit of platform reuse.

A successful Kafka architecture does not maximize the number of Kafka topics. It identifies the workloads that genuinely benefit from event streaming, gives them clear contracts and recovery policies, and chooses simpler or more specialized tools for everything else.

Preventing a Kafka-First Customer Data Platform from Breaking Task Queues, Batch Loads, and Globally Ordered Workflows

The Platform Under Review

Start with Kafka's Actual Operating Model

Workload 1: Fan-Out Customer Changes to Independent Services

Why Kafka fits

The decision to make about state

Workload 2: Replicate Legacy Database Changes

Where Kafka Connect helps

Where the apparent simplicity ends

Workload 3: Buffer High-Volume Application Logs

Why logging may need a separate cluster

Workload 4: Detect Fraud and Produce Real-Time Decisions

Why Kafka fits

What Kafka does not provide by itself

Workload 5: Assign One Job to One Worker

Workload 6: Upload a Daily Data Warehouse Batch

Workload 7: Send Large Documents Through Kafka

Workload 8: Enforce One Global Order

Workload 9: Query Events by Business Field

Compare Kafka, RabbitMQ, Pulsar, and Managed Services

A Decision Workflow for New Use Cases

The Revised Architecture

Common Kafka-First Mistakes

Choosing Kafka because it is already deployed

Treating every asynchronous operation as event streaming

Expecting brokers to route or validate business content

Using Kafka as a searchable database

Ignoring retention when consumers can be offline

Sending large files as ordinary events

Assuming partitions provide global order

Adding real-time processing where batch results are sufficient

Mixing log traffic with critical business events without isolation

Architecture Review Checklist

Conclusion

Comments0