A platform team is building a unified customer data system for an online retailer. The first proposal is attractive: send every important change to Kafka, connect every service to it, and use the same platform for database replication, centralized logs, fraud detection, background tasks, daily warehouse exports, and large document transfers.
The proposal reduces the number of technologies on the architecture diagram, but it creates a more dangerous problem. Kafka's strengths can hide its constraints. A design that works well for high-rate event streams may become awkward for one-worker task queues, strict global ordering, random data lookup, large files, or time-bounded batch jobs.
The goal is not to prove that Kafka is always the best choice or always unnecessary. The goal is to match each workload to the behavior it actually needs. This tutorial walks through an architecture review for a customer data platform and shows how to keep Kafka where its log-based, partitioned, publish-subscribe model provides real value.
The Platform Under Review
The retailer wants to combine information from several systems:
CustomerServiceowns customer profile data.OrderServicepublishes purchases and order status changes.PaymentServicepublishes successful and rejected payments.SupportServicerecords customer interactions.- Operational databases contain data that older applications cannot publish directly.
- Every service produces logs and operational metrics.
- A fraud component must react to suspicious transactions quickly.
- A recommendation component needs recent browsing and purchase activity.
- A data warehouse receives a daily bulk export.
- Worker services execute email, document, and image-processing jobs.
The initial Kafka-first design looks like this:
CustomerService ---------+
OrderService ------------+
PaymentService ----------+----> Kafka ----> Customer data projection
SupportService ----------+ | |
Legacy databases --------+ | +----> Fraud detection
Application logs --------+ +-------> Logging platform
Worker tasks ------------+
Daily warehouse batches -+
Large documents ---------+
The diagram is simple because everything appears to use the same transport. The simplicity is deceptive. These workloads have different requirements for fan-out, ordering, routing, storage, failure handling, data size, latency, and access patterns.
The architecture review must answer one question for each workload:
Does Kafka's native operating model match the required behavior, or would the team need to build substantial logic around it?
Start with Kafka's Actual Operating Model
Kafka is a durable, partitioned event log. Producers append records to topics, and consumers pull records from partitions. Several independent consumer groups can read the same events without competing with one another.
Its essential characteristics are:
- Publish-subscribe delivery
- Durable storage on disk
- Partition-based parallelism
- Ordering inside one partition
- Sequential consumption
- Retention based on time, size, or compaction
- Client-side processing and validation
- High throughput for streams of relatively small records
- Replay while retained data remains available
Kafka brokers deliberately avoid interpreting business payloads. They store bytes, replicate partitions, and serve records to clients. Content-based routing, validation, enrichment, aggregation, and business decisions belong in producers, consumers, connectors, or stream-processing applications.
That design is powerful when the workload is naturally a stream of immutable facts. It becomes less natural when the broker itself must choose one worker, inspect content, retrieve a record by business field, or treat a group of records as a closed batch.
Workload 1: Fan-Out Customer Changes to Independent Services
When a customer changes an address, several services may need the update:
- Fulfillment needs the latest delivery address.
- Billing may need invoice information.
- Customer support needs a current profile.
- Analytics needs the change for reporting.
This is a strong Kafka use case because one event can be retained and consumed independently by several groups.
CustomerService
|
| customer profile changed
v
Kafka topic: customer-profile-events
|
+--------------------+--------------------+
| | |
v v v
Fulfillment group Billing group Support projection group
The customer identifier should be used as the record key when updates for one customer must remain ordered. Kafka then routes records with that key to the same partition.
Each consumer group maintains its own progress. If the support projection is temporarily offline, billing can continue processing. When support returns, it can resume from its committed position while the records remain within the topic's retention window.
Why Kafka fits
Kafka matches the requirement because:
- One event must reach multiple independent subscribers.
- Subscribers should be able to fail or scale independently.
- Events need durable buffering.
- Per-customer ordering is sufficient.
- New consumers may need to replay retained history.
The decision to make about state
The team must decide whether the topic represents a temporary event stream or a rebuildable latest-state log.
With normal time-based retention, downstream services need their own persistent storage. Once old records expire, Kafka no longer contains the full history required to reconstruct state.
With compaction, Kafka retains at least the latest value for each key. A consumer can rebuild a current customer projection by reading the compacted topic from the beginning. However, access remains sequential. Compaction does not turn Kafka into a database that can instantly fetch one customer by identifier.
A practical design is:
Kafka:
durable event transport and replay
Consumer database:
indexed customer projection for API queries
Kafka distributes changes. A database serves random business queries.
Workload 2: Replicate Legacy Database Changes
Some older systems cannot publish business events. Their databases still contain information needed by the customer platform.
Kafka Connect can run source connectors that read from external systems and publish records into Kafka. Sink connectors can read Kafka topics and write into databases, object storage, or analytical systems.
Legacy database
|
| source connector
v
Kafka Connect
|
v
Kafka topics
|
+---------------------+
| |
v v
Customer projection Object storage
^ ^
| sink connector | sink connector
+---------------------+
This is useful when the integration is mostly continuous data movement rather than custom business processing.
Where Kafka Connect helps
Kafka Connect can reduce repeated integration code by handling:
- Source polling
- Kafka production
- Kafka consumption
- Common data conversion
- Connector execution and scaling
- Continuous transfer to target systems
It is a good candidate when several systems need the same replicated data and Kafka already acts as the central distribution layer.
Where the apparent simplicity ends
A connector usually exposes lower-level source changes, not necessarily meaningful domain events. A database row update may tell consumers which columns changed, but it may not explain why the business operation happened.
The platform team must also account for:
- An additional Kafka Connect cluster to operate
- Connector licensing and support costs
- Source and target outages
- Error recovery
- Sensitive data exposure
- Table or schema changes
- Weaknesses in complex transformation support
Basic field filtering or masking can fit inside a connector pipeline. Joining several records, maintaining state, or applying advanced business rules usually belongs in a stream-processing application or a dedicated service.
The architecture should separate data movement from business interpretation:
Database change
|
v
Source connector
|
v
Raw change topic
|
v
Transformation or enrichment service
|
v
Business event topic
This extra step prevents every consumer from coupling directly to the physical structure of a legacy table.
Workload 3: Buffer High-Volume Application Logs
A distributed application produces logs from many services and instances. Sending each service directly to the final indexing system makes applications dependent on that system's availability and ingestion capacity.
Kafka can act as a durable buffer:
Applications
|
v
Log collectors
|
v
Dedicated Kafka log topics
|
v
Kafka Connect sink
|
v
Elasticsearch or another log index
|
v
Search and dashboards
Applications can write through standard logging frameworks while agents such as Fluentd collect and forward log files. This keeps Kafka-specific producer logic out of business code.
Kafka helps because it can absorb traffic bursts and allow the indexing system to catch up after an interruption. It also decouples the application fleet from a particular log-analysis product.
Why logging may need a separate cluster
Log traffic has different priorities from business events.
Business events may prioritize durability and low latency because losing a payment or inventory update is unacceptable. Logging commonly prioritizes sustained throughput and large-volume buffering. The tolerance for losing a small amount of diagnostic data may also differ from the tolerance for losing financial data.
Placing both workloads on one cluster can allow a log spike to compete with critical event streams for disk, network, memory, and broker capacity.
A dedicated logging cluster provides:
- Resource isolation
- Independent capacity planning
- Retention policies suited to logs
- Tuning focused on throughput
- Reduced risk to business-event latency
For a small application with modest log volume, Kafka may be excessive. Direct collection agents or a simpler centralized logging path can be easier to operate.
Workload 4: Detect Fraud and Produce Real-Time Decisions
A fraud detector consumes transaction events, evaluates them, and publishes suspicious results.
Payment events
|
v
Kafka topic: payment-events
|
v
FraudDetectionService
|
| filters, enriches, evaluates
v
Kafka topic: suspicious-payments
|
v
Alerting and investigation systems
The fraud application is both a consumer and a producer. Kafka transports and stores the records, but it does not perform the fraud calculation inside the broker.
Kafka Streams, Apache Flink, or another processing framework can provide higher-level operations for filtering, transforming, aggregating, joining, and maintaining state over event streams.
Why Kafka fits
This workload benefits from:
- Low-delay reaction to new data
- Replay after a logic correction
- Independent downstream consumers
- Partition-based scaling
- Durable input and output topics
- A clear separation between transport and processing
What Kafka does not provide by itself
Kafka alone does not:
- Inspect transaction content
- Decide whether a transaction is fraudulent
- Join transaction data with customer history
- Maintain analytical state
- Route messages according to business fields
Those responsibilities belong in the processing layer.
Real-time processing also adds costs:
- Developers need streaming-specific knowledge.
- Stateful processing requires recovery planning.
- Distributed debugging is harder than debugging a local batch.
- Tuning memory, disk, and parallelism takes testing.
- Tooling may be less mature than conventional application tooling.
- Not every language ecosystem has equal framework support.
Real-time processing is justified only when the decision loses value if delayed. A daily marketing report does not need the same architecture as fraud prevention.
Workload 5: Assign One Job to One Worker
The same platform also needs background tasks such as sending one email, resizing one image, or generating one document. Each task should be completed by one available worker.
This is naturally a point-to-point queue problem.
Kafka uses a publish-subscribe model. Consumer groups can divide partitions among workers, but the abstraction remains partition-oriented rather than individual-message queue routing. Building exact worker assignment, broker-side routing, per-message acknowledgment semantics, or priority behavior can require awkward application logic.
RabbitMQ is often a better candidate when the core requirement is:
- One message should go to one worker.
- The broker should route messages using rules.
- Different queues represent different work destinations.
- Complex routing or task distribution matters more than replayable history.
RabbitMQ uses exchanges, bindings, and queues:
Producer
|
v
Exchange
|
+---- binding ----> Email queue ----> Email worker
|
+---- binding ----> Image queue ----> Image worker
|
+---- binding ----> Document queue -> Document worker
The broker can route according to configured rules. This is a different philosophy from Kafka's smart clients and simple broker data path.
Kafka can support worker-style consumption in some designs, but the team should not choose it merely because Kafka is already available. The operational convenience of one platform does not automatically compensate for a mismatch in messaging semantics.
Workload 6: Upload a Daily Data Warehouse Batch
The warehouse team wants every day's transactions collected and uploaded in one bulk operation.
Kafka can supply the source records, but it does not naturally define a closed daily batch. Events form a continuous stream, can arrive late, and are split across partitions.
A consumer building a daily batch must answer:
- How are records gathered from every partition?
- Which event time defines the day?
- When is the batch considered complete?
- How are late events handled?
- How are partial upload failures retried?
- How are only failed records selected when Kafka is read sequentially?
A conceptual implementation becomes more complicated than the requirement suggests:
Continuous transaction stream
|
v
Collect records across partitions
|
v
Apply time boundary
|
v
Wait for late events or close batch
|
v
Bulk upload
|
+---- partial failures require custom recovery
If the business genuinely needs a once-per-day transfer and does not require immediate processing, a conventional extract-transform-load process may be simpler and less expensive.
Kafka can still feed the warehouse, especially when the same events already support real-time consumers. The mistake is treating Kafka as proof that the warehouse load itself must become a streaming application.
Use batch tools when the workload is fundamentally batch-oriented. Use streaming when incremental results provide business value.
Workload 7: Send Large Documents Through Kafka
The retailer wants to transfer invoices, images, and generated reports between services. These files may be much larger than ordinary event records.
Kafka is optimized for streams of relatively small messages. Large records consume producer memory, consumer memory, network bandwidth, and replication capacity. Increasing size limits also requires coordinated changes across brokers, topics, producers, consumers, and replica fetching.
The preferred architecture stores the content elsewhere and sends a reference:
Producer
|
| store document
v
Object or file storage
|
| return document reference
v
Kafka event containing:
- document identifier
- storage reference
- type
- checksum or relevant metadata
|
v
Consumer retrieves document from storage
This pattern keeps Kafka records small and leaves large binary storage to a system designed for it.
The tradeoff is lifecycle coordination. If the event expires but the file remains, storage can accumulate orphaned objects. If the file is deleted too early, a consumer replay cannot retrieve it. Retention and cleanup must therefore be designed across both systems.
Splitting one large message into fragments is possible, but it creates more complexity. All fragments must remain together, ordering must be preserved, failed consumers must recover partially received data, and the application must reassemble the content correctly.
Use fragmentation only when the reference pattern is impossible and the added failure modes are acceptable.
Workload 8: Enforce One Global Order
The business plans a limited-stock sale and asks for every purchase request to be processed in one exact order across all regions.
Kafka guarantees ordering inside a partition. It does not provide one global order across several partitions.
A single-partition topic creates one sequence:
Partition 0:
request 1 -> request 2 -> request 3 -> request 4
But it also removes partition-level parallelism and limits scalability for that workflow.
Several producers introduce another issue. Kafka preserves broker arrival order. Network delay can cause a request created earlier to arrive after a later request from another producer.
Producer A creates event A first
Producer B creates event B later
Network delay:
B reaches Kafka first
A reaches Kafka later
Kafka order:
B -> A
If business fairness depends on a globally authoritative sequence, the system needs a deliberate sequencing design. Possible approaches include:
- One serialized ingress path
- A single-partition topic with accepted throughput limits
- Explicit sequence metadata assigned by an authoritative component
- A downstream reordering stage based on defined event-time rules
Sequence numbers can help consumers reconstruct order, but they do not make Kafka globally order events at ingestion. The ordering authority must exist somewhere in the architecture.
Workload 9: Query Events by Business Field
A support API needs to retrieve all activity for one customer, filter purchases by date, and search by email address.
Kafka consumers read partitions sequentially. They can seek using offsets or timestamps, but Kafka does not index arbitrary payload fields.
The following access pattern is a poor fit for Kafka as primary query storage:
Find all events where:
customerEmail = a specific value
purchaseAmount exceeds a threshold
supportCategory matches a phrase
A consumer would have to read and deserialize records until it found matching content.
The correct design projects events into an indexed datastore:
Kafka event stream
|
v
Projection consumer
|
v
Relational, document, or search database
|
v
Customer support API
Kafka remains the event backbone. The database provides random access, filtering, indexing, and query performance.
Compare Kafka, RabbitMQ, Pulsar, and Managed Services
No comparison table can replace testing, but it can expose architectural mismatches early.
| Requirement | Kafka | RabbitMQ | Apache Pulsar | Managed event service |
|---|---|---|---|---|
| Replayable event history | Strong fit | Depends on queue or stream model | Strong fit | Product-dependent |
| High-rate publish-subscribe streams | Strong fit | Possible, but traditional queues have a different focus | Strong fit | Often strong |
| Complex broker-side routing | Limited | Strong fit through exchanges and bindings | Depends on design and features | Product-dependent |
| One-task-to-one-worker queue | Possible but not the natural model | Strong fit | Supported through subscription models | Product-dependent |
| Separate compute and storage scaling | Traditional brokers combine both | Not the primary model | Core architectural feature | Often abstracted by provider |
| Low operational responsibility | Requires self-management unless managed | Requires self-management unless managed | Requires self-management unless managed | Strong fit |
| Kafka client compatibility | Native | Different protocol and clients | Different clients and APIs | Some services support Kafka protocol |
Apache Pulsar is architecturally close to Kafka. It uses partitioned topics and publish-subscribe communication, but separates stateless brokers from persistent storage managed by Apache BookKeeper. This can make compute and storage scaling more independent.
Kafka brokers traditionally handle both client communication and partition storage. That difference affects recovery, resource management, and operational design.
Managed cloud services can reduce provisioning and cluster-maintenance work. The tradeoff is less control, platform-specific behavior, pricing models, quotas, and possible ecosystem differences. Protocol compatibility does not guarantee identical semantics.
A Decision Workflow for New Use Cases
The platform team can use a small architectural decision process before creating a topic.
function choosePlatform(workload):
if workload.requiresContentIndexedQueries:
return "Database or search datastore"
if workload.containsLargeBinaryPayloads:
return "External object storage plus event reference"
if workload.requiresOneMessageForOneWorker:
return "Queue-oriented broker"
if workload.requiresStrictGlobalOrder:
return "Central sequencing design, possibly one partition"
if workload.isScheduledBatchWithoutRealTimeValue:
return "ETL or batch-processing tool"
if workload.requiresReplayableFanOutOrContinuousStreaming:
return "Evaluate Kafka or Pulsar"
return "Compare the simplest viable options"
The result is not an automatic technology selection. It is a prompt to examine the requirement before reusing the existing platform.
For workloads that remain Kafka candidates, continue with these questions:
- How many independent consumer groups need each event?
- Which entity defines the ordering key?
- What throughput and latency are required?
- How long must consumers be able to recover and replay?
- Is compaction required for current state?
- Where will searchable projections be stored?
- Does processing require joins, windows, or state?
- Are message sizes compatible with Kafka?
- Can the team operate brokers, connectors, and processing applications?
- Would a managed service reduce risk more than it reduces control?
The Revised Architecture
After the review, the platform uses different tools for different jobs:
Business microservices
|
v
Kafka business-event cluster
|
+----> Customer projection database
+----> Fraud stream processor
+----> Recommendation processor
+----> Analytical consumers
Legacy databases
|
v
Kafka Connect
|
v
Raw integration topics
|
v
Transformation services
|
v
Business-event topics
Application logs
|
v
Log collectors
|
v
Separate Kafka logging cluster
|
v
Search and dashboard platform
Background tasks
|
v
Queue-oriented broker
|
v
Worker pools
Large documents
|
v
Object storage
|
v
Kafka carries references only
Daily warehouse workload
|
v
Batch or ETL pipeline
The architecture contains more than one product, but each component has a clear responsibility. Kafka remains central without becoming a forced solution for every communication problem.
Common Kafka-First Mistakes
Choosing Kafka because it is already deployed
Existing expertise and infrastructure are valid factors, but they should not override a major semantic mismatch.
Treating every asynchronous operation as event streaming
A durable business event, a background command, a log line, and a daily batch record are all asynchronous data, but they do not require the same behavior.
Expecting brokers to route or validate business content
Kafka brokers do not deserialize payloads. Routing, validation, and transformation require clients or processing applications.
Using Kafka as a searchable database
Retention and replay do not provide indexes on business fields. Build a projection in an appropriate datastore.
Ignoring retention when consumers can be offline
Kafka can delete records after retention expires even when a consumer has not processed them. Recovery objectives must determine retention.
Sending large files as ordinary events
Large payloads affect producers, brokers, replicas, networks, and consumers. Store the content externally and publish a reference when possible.
Assuming partitions provide global order
Partitions provide parallelism and local order. Global order requires sacrificing parallelism or introducing an explicit sequencing mechanism.
Adding real-time processing where batch results are sufficient
Streaming adds operational and development complexity. Use it when low-delay results create real value.
Mixing log traffic with critical business events without isolation
Log volume can compete with business workloads. Separate clusters may be appropriate because reliability, retention, throughput, and latency goals differ.
Architecture Review Checklist
Before approving Kafka for a workload, confirm:
- The data is naturally represented as an append-only event stream.
- Multiple consumers need independent access or replay.
- Partition-based parallelism matches the processing model.
- Per-key ordering is sufficient.
- The payloads are reasonably small.
- Sequential access is acceptable.
- Searchable state will be projected into another datastore.
- Retention supports the maximum consumer outage.
- Processing logic belongs outside the broker.
- Batch boundaries are not being forced onto a continuous stream.
- The workload does not require complex broker-side routing.
- Operational costs for Kafka, Connect, and stream processing are justified.
- Schema changes and sensitive data exposure have a governance plan.
- Alternatives have been compared using functional and nonfunctional requirements.
Conclusion
Kafka is an excellent foundation for replayable event distribution, high-rate data integration, log buffering, and real-time processing. Its value comes from a deliberate architecture: durable partitions, independent consumer groups, sequential reads, and simple brokers that leave business logic to clients.
Those same characteristics create clear boundaries. Kafka is not a content-indexed database, a large-file store, a naturally global queue, or a batch scheduler. It can be adapted to some of those jobs, but the surrounding code and operational complexity may erase the benefit of platform reuse.
A successful Kafka architecture does not maximize the number of Kafka topics. It identifies the workloads that genuinely benefit from event streaming, gives them clear contracts and recovery policies, and chooses simpler or more specialized tools for everything else.