A customer-data platform begins as a small Kafka proof of concept inside one network. Producers and consumers use unencrypted connections, every developer can access every topic, and brokers also manage cluster metadata. The setup works well enough to demonstrate real-time profile and transaction processing.
Production changes the risk model. Customer events may contain personal data. Analytics teams run in the cloud while operational systems remain on-premises. Security teams require authenticated identities, encrypted communication, restricted topic access, protected storage, and a documented failover design. The platform must also remain manageable when a controller or broker fails.
The dangerous mistake is to solve these concerns independently. A team may secure external clients but leave broker-to-broker or controller traffic unprotected. It may stretch one cluster across network boundaries without accounting for latency and outages. It may configure three controllers but misunderstand that the metadata quorum needs a majority. It may enable Transport Layer Security, or TLS, and assume that records are encrypted on disk.
This tutorial designs a production Kafka architecture for a hybrid customer-data platform. The focus is preserving metadata availability, choosing the right deployment boundary, securing every participant, and testing the failure cases before real customer events enter the system.
The Enterprise Scenario
Assume the organization has these components:
ProfileServicepublishes customer profile changes from an on-premises environment.TransactionServicepublishes customer transaction events.CustomerViewServicebuilds an operational customer view.- Cloud analytics services consume a selected subset of events.
- Kafka brokers store and distribute records.
- KRaft controllers manage cluster metadata.
- A replication component moves approved topics between environments.
- Enterprise identity, certificate, network, and storage systems provide security controls.
The target architecture is:
On-premises environment
ProfileService ---------+
TransactionService -----+----> Local Kafka cluster
CustomerViewService <---+ |
| selected-topic replication
v
Cloud environment Cloud Kafka cluster
|
+----> Analytics
+----> Machine learning
+----> Business intelligence
The architecture has four separate concerns:
- Metadata availability: Controllers must maintain a consistent view of topics, partitions, brokers, access rules, and other cluster metadata.
- Data availability: Brokers must continue serving partition data during failures according to the replication policy.
- Deployment placement: Kafka must fit network, operational, cost, latency, and compliance constraints.
- Security: Every client, broker, controller, and supporting component must be authenticated, authorized, and protected in transit and at rest.
Why the Metadata Plane Must Be Designed Explicitly
Kafka brokers move records between producers and consumers, store partitions, and replicate data. KRaft controllers manage the metadata that tells the cluster how those brokers and partitions are organized.
The controller metadata includes information such as:
- Topic names and configuration
- Partition leaders and replica placement
- In-sync replica state
- Broker identifiers and connection details
- Controller roles
- Access control lists
- Client quotas
- Consumer group assignments
- Transaction-related metadata references
Clients do not read the metadata log directly. Brokers receive metadata from the controllers and provide the information clients need.
A simplified architecture is:
KRaft controller quorum
- metadata leader
- metadata followers
|
| metadata updates and broker heartbeats
v
Kafka broker cluster
|
+---- producers
+---- consumers
+---- Kafka Connect
+---- Schema Registry
The metadata log is replicated using the Raft consensus model. A change is committed only after a majority of controllers acknowledges it. This protects committed cluster metadata from being lost during leader replacement.
Build a Majority-Based KRaft Quorum
A controller quorum should use an odd number of members so that a clear majority can be formed.
With three controllers:
Controllers: 3
Majority required: 2
Tolerated controller failures: 1
If one controller fails, the remaining two still form a majority. If two fail, the quorum cannot commit metadata changes.
This does not mean every broker stops immediately when the majority is lost. Existing data traffic may continue temporarily depending on the operation and current cluster state, but metadata changes and reliable control-plane operation require a functioning quorum.
A three-controller static configuration can be represented like this:
node.id=21
process.roles=controller
controller.quorum.voters=21@meta-a:9193,22@meta-b:9193,23@meta-c:9193
controller.listener.names=CONTROL
listeners=CONTROL://meta-a:9193
listener.security.protocol.map=CONTROL:SASL_SSL
Each controller uses a unique node.id and its own listener address. The other controllers use equivalent configuration with their corresponding identifiers and hosts.
The property meanings are:
| Property | Purpose |
|---|---|
node.id |
Uniquely identifies the server in the cluster |
process.roles |
Defines whether the node is a controller, broker, or both |
controller.quorum.voters |
Lists static controller quorum members |
controller.listener.names |
Identifies the listener used for controller communication |
listeners |
Defines where the process accepts connections |
listener.security.protocol.map |
Maps listener names to security protocols |
Dynamic controller membership can use controller bootstrap servers and the supported Kafka storage and metadata quorum tools. The important design choice is whether controller membership is intentionally static or operationally managed as a dynamic quorum.
Separate Controllers and Brokers in Production
Kafka supports combined nodes with both roles:
process.roles=broker,controller
Combined mode can be useful for a small development environment, but dedicated roles give production systems clearer resource and failure boundaries.
A dedicated broker configuration may look like this:
node.id=31
process.roles=broker
controller.quorum.voters=21@meta-a:9193,22@meta-b:9193,23@meta-c:9193
controller.listener.names=CONTROL
listeners=INTERNAL://broker-a:9192,CLIENT://broker-a.example:9195
advertised.listeners=INTERNAL://broker-a:9192,CLIENT://broker-a.example:9195
inter.broker.listener.name=INTERNAL
listener.security.protocol.map=CONTROL:SASL_SSL,INTERNAL:SASL_SSL,CLIENT:SASL_SSL
This separation prevents heavy client and partition workloads from competing directly with controller duties on the same process. It also lets the team scale broker storage and throughput independently from the metadata quorum.
The controller quorum should remain small and stable. Broker capacity can grow as traffic and retained data increase.
Understand Controller and Broker Failover Separately
A controller failure and a broker failure affect different parts of Kafka.
Active controller failure
Follower controllers continuously fetch metadata updates from the active controller. When they stop receiving responses within the configured timeout, they start a new election.
A candidate requests votes and presents information about its metadata log. A controller grants a vote only according to the Raft election rules, including whether the candidate's log is sufficiently current.
The new active controller does not immediately process metadata changes. It first ensures that its local metadata contains all committed entries. Brokers then locate the new controller and resume control-plane communication.
Active controller fails
|
v
Followers detect missing responses
|
v
Election starts
|
v
Majority selects a current candidate
|
v
New leader verifies committed metadata
|
v
Brokers reconnect to the active controller
Follower controller failure
The active controller can continue operating while a majority remains. The cluster has less failure tolerance until the follower returns or is replaced.
Broker failure
Brokers send heartbeats to the controller. When a broker is considered unavailable, partition leadership hosted on that broker may be reassigned to eligible replicas on surviving brokers.
The metadata quorum coordinates that change, while the broker replication design determines whether partition data remains available.
This distinction matters during testing. A controller failover test validates metadata leadership. A broker failover test validates partition leadership and data availability. Passing one does not prove the other.
Avoid Carrying ZooKeeper Assumptions into KRaft
Older Kafka deployments use a separate ZooKeeper cluster to store metadata and monitor broker liveness. One Kafka broker also acts as the active controller and coordinates metadata changes with ZooKeeper.
KRaft replaces that split design with Kafka's own controller quorum and metadata log.
Older architecture:
ZooKeeper quorum + Kafka brokers + active controller broker
KRaft architecture:
KRaft controller quorum + Kafka brokers
The operating model is different enough that teams should not copy ZooKeeper-era failover procedures, monitoring assumptions, or configuration directly into a KRaft deployment.
For an existing ZooKeeper-based platform, migration should be treated as an infrastructure project with explicit compatibility, rollback, testing, and operational-readiness work. It should not be combined casually with unrelated broker upgrades or security redesign.
Choose the Deployment Model from Constraints
The organization has three realistic choices:
- Self-managed Kafka on-premises or in a private environment
- Managed Kafka in the cloud
- A hybrid architecture
The correct choice depends on more than where servers are available.
Self-managed on-premises Kafka
This model gives the organization control over:
- Kafka versions
- Broker and controller settings
- Network placement
- Security integrations
- Storage layout
- Monitoring tools
- Upgrade timing
- Data location
The cost is operational responsibility. The team must manage provisioning, authentication, authorization, TLS, monitoring, scaling, maintenance, backups, and disaster recovery.
This model fits organizations with strict data-locality requirements or experienced Kafka and infrastructure teams.
Managed cloud Kafka
A managed service can remove much of the hardware and cluster administration. Provisioning, patching, and baseline availability are handled by the provider.
The tradeoffs include:
- Less access to low-level broker configuration
- Provider-controlled upgrade schedules
- Product-specific authentication and monitoring integrations
- Potentially higher cost at scale
- Data-transfer charges
- Migration difficulty when proprietary capabilities are adopted
- Data-location and compliance questions
A managed service is attractive when the organization values reduced operational load more than detailed control.
Hybrid Kafka
A hybrid system keeps operational data processing close to on-premises producers while providing selected data to cloud services.
The strongest general pattern uses two clusters:
On-premises Kafka
|
| replication
v
Cloud Kafka
A replication tool consumes from the source cluster and produces to the target cluster. Apache MirrorMaker, Confluent Replicator, or another suitable replication solution can support this pattern.
Separate clusters prevent the cloud analytics environment from becoming part of the local cluster's synchronous broker or controller path. Network interruption then creates replication lag instead of directly stopping local production.
Why One Cross-Environment Cluster Is Risky
A single cloud Kafka cluster can serve both on-premises and cloud clients:
On-premises applications
|
| wide-area network
v
Cloud Kafka cluster
|
v
Cloud analytics
This architecture is simpler because it avoids a second cluster and replication layer. It also makes on-premises operations depend on:
- Stable wide-area connectivity
- Sufficient network bandwidth
- Predictable latency
- Cloud availability
- Data-transfer pricing
- Firewall and routing correctness
When the network fails, local producers and consumers can lose access to their event backbone.
The architecture is appropriate when the organization is intentionally centralizing in the cloud and the remaining on-premises dependencies are limited. It is a poor fit when local operational workflows must continue during a cloud or network outage.
Place Replication with Network Behavior in Mind
A replication component behaves as both a consumer and a producer:
Source Kafka
|
| consumer fetch across network
v
Replication process
|
| producer write
v
Target Kafka
Replication tools are often placed near the target cluster so producer writes and acknowledgments remain local and predictable. Source fetches cross the wide-area network and may accumulate lag.
The exact placement depends on:
- Network topology
- Security boundaries
- Firewall rules
- Throughput
- Latency
- Cost
- Operational ownership
Replication lag must be treated as a first-class metric. Cloud analytics cannot be assumed to contain the same data at the exact moment as the on-premises cluster.
The organization should also transfer only approved topics and fields. Sensitive data can be omitted, anonymized, or transformed before it leaves the controlled environment when business and compliance rules require that separation.
Secure Every Kafka Communication Path
Kafka security has three main layers:
- Encryption: Protects data from being read while it travels over the network.
- Authentication: Verifies the identity of a client, broker, or controller.
- Authorization: Decides what an authenticated identity may do.
The communication paths include:
- Producer to broker
- Consumer to broker
- Kafka Connect to broker
- Schema Registry to broker
- Broker to broker
- Broker to controller
- Controller to controller
Securing only the external client listener leaves internal cluster traffic exposed.
A listener plan can assign different names and policies:
CLIENT listener:
application producers and consumers
INTERNAL listener:
broker replication
CONTROL listener:
controller quorum and broker-controller communication
Each listener should have an explicit protocol and a documented network boundary.
Encrypt Traffic with TLS
Transport Layer Security protects network communication and verifies the certificate presented by the remote party.
A Kafka participant usually needs:
- A keystore containing its private key and certificate
- A truststore containing trusted certificate authorities
- Passwords required to access the key material
The certificate workflow is:
Generate key pair
|
v
Create certificate signing request
|
v
Certificate authority signs request
|
v
Import signed certificate
|
v
Configure keystore and truststore
A broker-side TLS configuration can use properties such as:
ssl.keystore.location=/secure/kafka/broker.keystore.jks
ssl.keystore.password=${file:/secure/kafka/secrets:keystore.password}
ssl.truststore.location=/secure/kafka/broker.truststore.jks
ssl.truststore.password=${file:/secure/kafka/secrets:truststore.password}
ssl.key.password=${file:/secure/kafka/secrets:key.password}
ssl.client.auth=required
When ssl.client.auth=required, clients must present trusted certificates. This enables mutual TLS, or mTLS, so both sides authenticate with certificates.
A client then requires corresponding truststore settings and, for mTLS, its own keystore and certificate.
Do not embed real passwords directly in general configuration files. Supported alternatives include:
- Environment variables
- Restricted external files
- Secret-management vault integrations
Certificate rotation must also be designed before certificates approach expiration.
Choose the Authentication Mechanism Deliberately
Kafka supports mTLS and several Simple Authentication and Security Layer, or SASL, mechanisms.
| Mechanism | Typical use |
|---|---|
| mTLS | Certificate-based service identity |
| SASL/PLAIN | Username and password, protected by TLS |
| SASL/SCRAM | Password authentication with salted challenge-response hashing |
| SASL/GSSAPI | Kerberos-based enterprise identity |
| SASL/OAUTHBEARER | Token-based authentication through OAuth 2.0 |
| Provider-specific mechanism | Authentication integrated with a cloud platform |
SASL authentication should normally be combined with TLS:
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512
Using SASL over an unencrypted connection authenticates the participant but does not protect the traffic from interception.
Organizations centered on Security Assertion Markup Language, or SAML, typically integrate indirectly. An identity provider can authenticate the user through SAML and issue an OAuth token that Kafka accepts through SASL/OAUTHBEARER. A proxy or custom SASL implementation is another possibility, but it adds complexity.
Authentication must be applied consistently to application clients, brokers, controllers, Kafka Connect, Schema Registry, and administrative tools.
Enforce Least Privilege with ACLs
Authentication tells Kafka who is connecting. Access control lists, or ACLs, determine what that principal may do.
An ACL includes:
- Principal
- Resource type
- Resource name
- Literal or prefix matching
- Operation
- Allow or deny decision
- Host restriction
A producer identity might receive write and create access to a topic prefix:
kafka-acls.sh \
--bootstrap-server secured-broker:9195 \
--add \
--allow-principal User:profile-publisher \
--operation Write \
--operation Create \
--resource-pattern-type prefix \
--topic CUSTOMER-PROFILE
A consumer requires more than a topic read permission. It also needs the permissions required to operate within its consumer group, access metadata, and commit offsets through Kafka's internal mechanisms.
Transactional producers and idempotent producers can require additional permissions for transactional identifiers or idempotent writes. Do not assume that a basic producer helper flag grants every permission required for exactly-once workflows.
A practical least-privilege model defines identities by application role:
profile-publisher:
write CUSTOMER-PROFILE topics
customer-view-reader:
read CUSTOMER-PROFILE topics
use CUSTOMER-VIEW consumer groups
cloud-replicator:
read approved local topics
write approved cloud topics
platform-operator:
controlled administrative permissions
Topic-level authorization also affects event design. If sensitive and non-sensitive fields share one topic, every principal allowed to consume that topic can receive the full payload. Kafka ACLs do not provide field-level authorization.
Protect Data at Rest
TLS protects records while they travel over the network. It does not automatically encrypt the broker's stored log segments.
The normal flow is:
Encrypted network traffic
|
v
Broker decrypts record
|
v
Broker stores record on disk
|
v
Broker encrypts outgoing network traffic
Kafka does not provide built-in record encryption at rest. The organization must use another control.
Disk or filesystem encryption
Operating-system or infrastructure-level encryption protects Kafka data files on disk. Linux Unified Key Setup, or LUKS, is one on-premises option. Cloud providers offer managed disk-encryption capabilities.
This approach is transparent to Kafka clients and preserves normal broker processing.
End-to-end payload encryption
The producer encrypts selected fields or the entire payload before sending it. The consumer decrypts it after reading.
Producer encrypts payload
|
v
Kafka stores encrypted bytes
|
v
Authorized consumer decrypts payload
This can protect data even from broker-disk access, but it complicates:
- Key distribution and rotation
- Schema handling
- Stream processing
- Filtering and transformation
- Operational debugging
- Consumer authorization
If brokers or processing applications need to inspect the payload, full end-to-end encryption may be incompatible with that workflow.
The security design should distinguish network encryption, disk encryption, and payload encryption instead of treating them as one feature.
A Controlled Production Rollout
A production rollout should prove the architecture in stages.
Phase 1: Build the control plane
- Deploy the KRaft controller quorum.
- Verify unique node identities and listener reachability.
- Confirm the quorum has an active leader and synchronized followers.
- Test one controller failure.
- Test restoration of the failed controller.
Phase 2: Add brokers
- Deploy dedicated brokers.
- Verify broker-controller communication.
- Confirm partition placement and replication.
- Stop one broker.
- Verify partition leadership moves to eligible replicas.
- Confirm producers and consumers recover as designed.
Phase 3: Enable transport security
- Create certificates and trust relationships.
- Enable TLS on controller, broker, and client paths.
- Enable mTLS or SASL authentication.
- Test expired, unknown, and revoked credentials.
- Verify that plaintext listeners are removed or isolated according to policy.
Phase 4: Apply authorization
- Create one principal per application role.
- Grant only required topic, group, cluster, and transaction permissions.
- Test allowed operations.
- Test denied topic and consumer-group access.
- Verify administrative permissions are separate from application permissions.
Phase 5: Protect storage
- Enable disk or filesystem encryption.
- Define key ownership and rotation.
- Decide whether selected payload fields need end-to-end encryption.
- Verify that backup and recovery processes preserve protection.
Phase 6: Introduce hybrid replication
- Create the cloud cluster.
- Replicate only approved topics.
- Measure normal replication lag.
- Interrupt the network path.
- Verify local production continues.
- Restore connectivity and measure catch-up.
- Confirm that cloud consumers handle delayed arrival correctly.
Test Failure and Security Boundaries
Test metadata quorum loss
- Stop one controller in a three-controller quorum.
- Verify that metadata operations continue.
- Stop a second controller in a controlled test environment.
- Confirm that the quorum can no longer commit metadata changes.
- Restore controllers and verify consistent recovery.
Test active controller replacement
- Identify the active controller.
- Stop it.
- Observe a follower election.
- Verify that brokers reconnect to the new leader.
- Confirm that committed metadata remains intact.
Test broker failure
- Produce data continuously.
- Stop a broker hosting partition leaders.
- Verify leader reassignment.
- Confirm acknowledged records remain readable.
- Check client recovery and error rates.
Test unauthorized access
- Connect with a valid identity that lacks topic permission.
- Attempt a prohibited produce or consume operation.
- Confirm that Kafka denies the request.
- Verify the denial is visible in security monitoring.
Test certificate failure
- Connect with an untrusted certificate.
- Connect with an expired certificate.
- Confirm that TLS negotiation fails.
- Verify alerting and diagnostic information.
Test hybrid network interruption
- Stop connectivity between local and cloud environments.
- Verify local producers and consumers continue.
- Measure replication backlog.
- Restore the connection.
- Confirm replication catches up without bypassing security controls.
Common Mistakes
Running production controllers and brokers together without evaluating contention
Combined mode simplifies infrastructure but mixes control-plane and data-plane workloads.
Treating three controllers as tolerance for two failures
A three-member quorum requires two members for a majority. It tolerates one controller failure.
Stretching local operational traffic across an unreliable cloud connection
A single remote cluster can turn a network outage into a local application outage.
Choosing a managed service only to avoid setup work
Managed services reduce operations but also affect control, cost, portability, and compliance.
Encrypting only producer and consumer connections
Broker replication and controller communication also carry sensitive operational data and credentials.
Assuming TLS encrypts Kafka log files
TLS protects data in motion. Stored segments require disk, filesystem, cloud, or application-level encryption.
Reusing one broad principal for many applications
Shared identities make least privilege, auditing, revocation, and incident response harder.
Granting only topic read permission to consumers
Consumers also need the appropriate group and metadata-related permissions.
Putting sensitive and public fields in the same topic
Kafka authorization is usually resource-based, not field-based. Topic access exposes the complete record.
Hardcoding keystore and truststore passwords
Use environment substitution, restricted files, or a secret-management integration.
Waiting until production to enable security
Late security changes can alter listeners, identities, deployment automation, testing, and network rules. Enable representative controls early.
Production Checklist
Before approving the enterprise Kafka platform, confirm:
- Controllers and brokers have intentionally selected roles.
- Every node has a unique identifier.
- The controller quorum uses a majority-based design.
- Controller failover has been tested.
- Broker failover has been tested separately.
- ZooKeeper-era procedures are not assumed to apply to KRaft.
- The deployment model follows latency, cost, compliance, and operational requirements.
- Hybrid replication is isolated from local synchronous operation.
- Replication lag is monitored.
- Every listener has a documented protocol and network boundary.
- Client, broker, controller, Connect, and registry communication is secured.
- Authentication mechanisms match enterprise identity requirements.
- ACLs follow least privilege for topics, groups, cluster operations, and transactions.
- Sensitive data is separated according to topic-level authorization limits.
- Passwords and private keys are stored securely.
- Certificate issuance and rotation are operationalized.
- Disk or filesystem encryption protects broker storage.
- End-to-end payload encryption is used only where its processing tradeoffs are acceptable.
- Failure, unauthorized-access, certificate, and network-interruption tests have passed.
Conclusion
Moving Kafka into an enterprise environment requires designing the control plane, deployment boundary, and security model as one system.
KRaft controllers protect cluster metadata through a majority quorum, while brokers handle partition storage and client traffic. Dedicated roles make those responsibilities easier to scale and troubleshoot. In a hybrid organization, separate local and cloud clusters can protect operational workloads from wide-area network failures while replication supplies cloud analytics asynchronously.
Security must cover every communication path, not only external clients. TLS protects data in motion, authentication establishes identity, ACLs enforce resource access, and disk or payload encryption protects stored records. None of these controls replaces the others.
When metadata failover, network placement, identity, authorization, secrets, and storage protection are tested together, Kafka can move from a permissive proof of concept to a production platform without turning one controller failure, cloud interruption, or stolen credential into an enterprise-wide incident.