Distributed SystemsSecurity
June 20, 2026

Designing a Hybrid Kafka Platform Without Losing Metadata Quorum or Exposing Customer Events

A customer-data platform begins as a small Kafka proof of concept inside one network. Producers and consumers use unencrypted connections, every developer can access every topic, and brokers also manage cluster metadata. The setup works well enough to demonstrate real-time profile and transaction processing.

Production changes the risk model. Customer events may contain personal data. Analytics teams run in the cloud while operational systems remain on-premises. Security teams require authenticated identities, encrypted communication, restricted topic access, protected storage, and a documented failover design. The platform must also remain manageable when a controller or broker fails.

The dangerous mistake is to solve these concerns independently. A team may secure external clients but leave broker-to-broker or controller traffic unprotected. It may stretch one cluster across network boundaries without accounting for latency and outages. It may configure three controllers but misunderstand that the metadata quorum needs a majority. It may enable Transport Layer Security, or TLS, and assume that records are encrypted on disk.

This tutorial designs a production Kafka architecture for a hybrid customer-data platform. The focus is preserving metadata availability, choosing the right deployment boundary, securing every participant, and testing the failure cases before real customer events enter the system.

The Enterprise Scenario

Assume the organization has these components:

  • ProfileService publishes customer profile changes from an on-premises environment.
  • TransactionService publishes customer transaction events.
  • CustomerViewService builds an operational customer view.
  • Cloud analytics services consume a selected subset of events.
  • Kafka brokers store and distribute records.
  • KRaft controllers manage cluster metadata.
  • A replication component moves approved topics between environments.
  • Enterprise identity, certificate, network, and storage systems provide security controls.

The target architecture is:

On-premises environment

ProfileService ---------+
TransactionService -----+----> Local Kafka cluster
CustomerViewService <---+          |
                                   | selected-topic replication
                                   v
Cloud environment                Cloud Kafka cluster
                                      |
                                      +----> Analytics
                                      +----> Machine learning
                                      +----> Business intelligence

The architecture has four separate concerns:

  1. Metadata availability: Controllers must maintain a consistent view of topics, partitions, brokers, access rules, and other cluster metadata.
  2. Data availability: Brokers must continue serving partition data during failures according to the replication policy.
  3. Deployment placement: Kafka must fit network, operational, cost, latency, and compliance constraints.
  4. Security: Every client, broker, controller, and supporting component must be authenticated, authorized, and protected in transit and at rest.

Why the Metadata Plane Must Be Designed Explicitly

Kafka brokers move records between producers and consumers, store partitions, and replicate data. KRaft controllers manage the metadata that tells the cluster how those brokers and partitions are organized.

The controller metadata includes information such as:

  • Topic names and configuration
  • Partition leaders and replica placement
  • In-sync replica state
  • Broker identifiers and connection details
  • Controller roles
  • Access control lists
  • Client quotas
  • Consumer group assignments
  • Transaction-related metadata references

Clients do not read the metadata log directly. Brokers receive metadata from the controllers and provide the information clients need.

A simplified architecture is:

KRaft controller quorum
  - metadata leader
  - metadata followers
          |
          | metadata updates and broker heartbeats
          v
Kafka broker cluster
          |
          +---- producers
          +---- consumers
          +---- Kafka Connect
          +---- Schema Registry

The metadata log is replicated using the Raft consensus model. A change is committed only after a majority of controllers acknowledges it. This protects committed cluster metadata from being lost during leader replacement.

Build a Majority-Based KRaft Quorum

A controller quorum should use an odd number of members so that a clear majority can be formed.

With three controllers:

Controllers: 3
Majority required: 2
Tolerated controller failures: 1

If one controller fails, the remaining two still form a majority. If two fail, the quorum cannot commit metadata changes.

This does not mean every broker stops immediately when the majority is lost. Existing data traffic may continue temporarily depending on the operation and current cluster state, but metadata changes and reliable control-plane operation require a functioning quorum.

A three-controller static configuration can be represented like this:

node.id=21
process.roles=controller

controller.quorum.voters=21@meta-a:9193,22@meta-b:9193,23@meta-c:9193

controller.listener.names=CONTROL
listeners=CONTROL://meta-a:9193

listener.security.protocol.map=CONTROL:SASL_SSL

Each controller uses a unique node.id and its own listener address. The other controllers use equivalent configuration with their corresponding identifiers and hosts.

The property meanings are:

Property Purpose
node.id Uniquely identifies the server in the cluster
process.roles Defines whether the node is a controller, broker, or both
controller.quorum.voters Lists static controller quorum members
controller.listener.names Identifies the listener used for controller communication
listeners Defines where the process accepts connections
listener.security.protocol.map Maps listener names to security protocols

Dynamic controller membership can use controller bootstrap servers and the supported Kafka storage and metadata quorum tools. The important design choice is whether controller membership is intentionally static or operationally managed as a dynamic quorum.

Separate Controllers and Brokers in Production

Kafka supports combined nodes with both roles:

process.roles=broker,controller

Combined mode can be useful for a small development environment, but dedicated roles give production systems clearer resource and failure boundaries.

A dedicated broker configuration may look like this:

node.id=31
process.roles=broker

controller.quorum.voters=21@meta-a:9193,22@meta-b:9193,23@meta-c:9193
controller.listener.names=CONTROL

listeners=INTERNAL://broker-a:9192,CLIENT://broker-a.example:9195
advertised.listeners=INTERNAL://broker-a:9192,CLIENT://broker-a.example:9195

inter.broker.listener.name=INTERNAL

listener.security.protocol.map=CONTROL:SASL_SSL,INTERNAL:SASL_SSL,CLIENT:SASL_SSL

This separation prevents heavy client and partition workloads from competing directly with controller duties on the same process. It also lets the team scale broker storage and throughput independently from the metadata quorum.

The controller quorum should remain small and stable. Broker capacity can grow as traffic and retained data increase.

Understand Controller and Broker Failover Separately

A controller failure and a broker failure affect different parts of Kafka.

Active controller failure

Follower controllers continuously fetch metadata updates from the active controller. When they stop receiving responses within the configured timeout, they start a new election.

A candidate requests votes and presents information about its metadata log. A controller grants a vote only according to the Raft election rules, including whether the candidate's log is sufficiently current.

The new active controller does not immediately process metadata changes. It first ensures that its local metadata contains all committed entries. Brokers then locate the new controller and resume control-plane communication.

Active controller fails
        |
        v
Followers detect missing responses
        |
        v
Election starts
        |
        v
Majority selects a current candidate
        |
        v
New leader verifies committed metadata
        |
        v
Brokers reconnect to the active controller

Follower controller failure

The active controller can continue operating while a majority remains. The cluster has less failure tolerance until the follower returns or is replaced.

Broker failure

Brokers send heartbeats to the controller. When a broker is considered unavailable, partition leadership hosted on that broker may be reassigned to eligible replicas on surviving brokers.

The metadata quorum coordinates that change, while the broker replication design determines whether partition data remains available.

This distinction matters during testing. A controller failover test validates metadata leadership. A broker failover test validates partition leadership and data availability. Passing one does not prove the other.

Avoid Carrying ZooKeeper Assumptions into KRaft

Older Kafka deployments use a separate ZooKeeper cluster to store metadata and monitor broker liveness. One Kafka broker also acts as the active controller and coordinates metadata changes with ZooKeeper.

KRaft replaces that split design with Kafka's own controller quorum and metadata log.

Older architecture:
ZooKeeper quorum + Kafka brokers + active controller broker

KRaft architecture:
KRaft controller quorum + Kafka brokers

The operating model is different enough that teams should not copy ZooKeeper-era failover procedures, monitoring assumptions, or configuration directly into a KRaft deployment.

For an existing ZooKeeper-based platform, migration should be treated as an infrastructure project with explicit compatibility, rollback, testing, and operational-readiness work. It should not be combined casually with unrelated broker upgrades or security redesign.

Choose the Deployment Model from Constraints

The organization has three realistic choices:

  • Self-managed Kafka on-premises or in a private environment
  • Managed Kafka in the cloud
  • A hybrid architecture

The correct choice depends on more than where servers are available.

Self-managed on-premises Kafka

This model gives the organization control over:

  • Kafka versions
  • Broker and controller settings
  • Network placement
  • Security integrations
  • Storage layout
  • Monitoring tools
  • Upgrade timing
  • Data location

The cost is operational responsibility. The team must manage provisioning, authentication, authorization, TLS, monitoring, scaling, maintenance, backups, and disaster recovery.

This model fits organizations with strict data-locality requirements or experienced Kafka and infrastructure teams.

Managed cloud Kafka

A managed service can remove much of the hardware and cluster administration. Provisioning, patching, and baseline availability are handled by the provider.

The tradeoffs include:

  • Less access to low-level broker configuration
  • Provider-controlled upgrade schedules
  • Product-specific authentication and monitoring integrations
  • Potentially higher cost at scale
  • Data-transfer charges
  • Migration difficulty when proprietary capabilities are adopted
  • Data-location and compliance questions

A managed service is attractive when the organization values reduced operational load more than detailed control.

Hybrid Kafka

A hybrid system keeps operational data processing close to on-premises producers while providing selected data to cloud services.

The strongest general pattern uses two clusters:

On-premises Kafka
      |
      | replication
      v
Cloud Kafka

A replication tool consumes from the source cluster and produces to the target cluster. Apache MirrorMaker, Confluent Replicator, or another suitable replication solution can support this pattern.

Separate clusters prevent the cloud analytics environment from becoming part of the local cluster's synchronous broker or controller path. Network interruption then creates replication lag instead of directly stopping local production.

Why One Cross-Environment Cluster Is Risky

A single cloud Kafka cluster can serve both on-premises and cloud clients:

On-premises applications
      |
      | wide-area network
      v
Cloud Kafka cluster
      |
      v
Cloud analytics

This architecture is simpler because it avoids a second cluster and replication layer. It also makes on-premises operations depend on:

  • Stable wide-area connectivity
  • Sufficient network bandwidth
  • Predictable latency
  • Cloud availability
  • Data-transfer pricing
  • Firewall and routing correctness

When the network fails, local producers and consumers can lose access to their event backbone.

The architecture is appropriate when the organization is intentionally centralizing in the cloud and the remaining on-premises dependencies are limited. It is a poor fit when local operational workflows must continue during a cloud or network outage.

Place Replication with Network Behavior in Mind

A replication component behaves as both a consumer and a producer:

Source Kafka
      |
      | consumer fetch across network
      v
Replication process
      |
      | producer write
      v
Target Kafka

Replication tools are often placed near the target cluster so producer writes and acknowledgments remain local and predictable. Source fetches cross the wide-area network and may accumulate lag.

The exact placement depends on:

  • Network topology
  • Security boundaries
  • Firewall rules
  • Throughput
  • Latency
  • Cost
  • Operational ownership

Replication lag must be treated as a first-class metric. Cloud analytics cannot be assumed to contain the same data at the exact moment as the on-premises cluster.

The organization should also transfer only approved topics and fields. Sensitive data can be omitted, anonymized, or transformed before it leaves the controlled environment when business and compliance rules require that separation.

Secure Every Kafka Communication Path

Kafka security has three main layers:

  1. Encryption: Protects data from being read while it travels over the network.
  2. Authentication: Verifies the identity of a client, broker, or controller.
  3. Authorization: Decides what an authenticated identity may do.

The communication paths include:

  • Producer to broker
  • Consumer to broker
  • Kafka Connect to broker
  • Schema Registry to broker
  • Broker to broker
  • Broker to controller
  • Controller to controller

Securing only the external client listener leaves internal cluster traffic exposed.

A listener plan can assign different names and policies:

CLIENT listener:
  application producers and consumers

INTERNAL listener:
  broker replication

CONTROL listener:
  controller quorum and broker-controller communication

Each listener should have an explicit protocol and a documented network boundary.

Encrypt Traffic with TLS

Transport Layer Security protects network communication and verifies the certificate presented by the remote party.

A Kafka participant usually needs:

  • A keystore containing its private key and certificate
  • A truststore containing trusted certificate authorities
  • Passwords required to access the key material

The certificate workflow is:

Generate key pair
      |
      v
Create certificate signing request
      |
      v
Certificate authority signs request
      |
      v
Import signed certificate
      |
      v
Configure keystore and truststore

A broker-side TLS configuration can use properties such as:

ssl.keystore.location=/secure/kafka/broker.keystore.jks
ssl.keystore.password=${file:/secure/kafka/secrets:keystore.password}

ssl.truststore.location=/secure/kafka/broker.truststore.jks
ssl.truststore.password=${file:/secure/kafka/secrets:truststore.password}

ssl.key.password=${file:/secure/kafka/secrets:key.password}
ssl.client.auth=required

When ssl.client.auth=required, clients must present trusted certificates. This enables mutual TLS, or mTLS, so both sides authenticate with certificates.

A client then requires corresponding truststore settings and, for mTLS, its own keystore and certificate.

Do not embed real passwords directly in general configuration files. Supported alternatives include:

  • Environment variables
  • Restricted external files
  • Secret-management vault integrations

Certificate rotation must also be designed before certificates approach expiration.

Choose the Authentication Mechanism Deliberately

Kafka supports mTLS and several Simple Authentication and Security Layer, or SASL, mechanisms.

Mechanism Typical use
mTLS Certificate-based service identity
SASL/PLAIN Username and password, protected by TLS
SASL/SCRAM Password authentication with salted challenge-response hashing
SASL/GSSAPI Kerberos-based enterprise identity
SASL/OAUTHBEARER Token-based authentication through OAuth 2.0
Provider-specific mechanism Authentication integrated with a cloud platform

SASL authentication should normally be combined with TLS:

security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-512

Using SASL over an unencrypted connection authenticates the participant but does not protect the traffic from interception.

Organizations centered on Security Assertion Markup Language, or SAML, typically integrate indirectly. An identity provider can authenticate the user through SAML and issue an OAuth token that Kafka accepts through SASL/OAUTHBEARER. A proxy or custom SASL implementation is another possibility, but it adds complexity.

Authentication must be applied consistently to application clients, brokers, controllers, Kafka Connect, Schema Registry, and administrative tools.

Enforce Least Privilege with ACLs

Authentication tells Kafka who is connecting. Access control lists, or ACLs, determine what that principal may do.

An ACL includes:

  • Principal
  • Resource type
  • Resource name
  • Literal or prefix matching
  • Operation
  • Allow or deny decision
  • Host restriction

A producer identity might receive write and create access to a topic prefix:

kafka-acls.sh \
  --bootstrap-server secured-broker:9195 \
  --add \
  --allow-principal User:profile-publisher \
  --operation Write \
  --operation Create \
  --resource-pattern-type prefix \
  --topic CUSTOMER-PROFILE

A consumer requires more than a topic read permission. It also needs the permissions required to operate within its consumer group, access metadata, and commit offsets through Kafka's internal mechanisms.

Transactional producers and idempotent producers can require additional permissions for transactional identifiers or idempotent writes. Do not assume that a basic producer helper flag grants every permission required for exactly-once workflows.

A practical least-privilege model defines identities by application role:

profile-publisher:
  write CUSTOMER-PROFILE topics

customer-view-reader:
  read CUSTOMER-PROFILE topics
  use CUSTOMER-VIEW consumer groups

cloud-replicator:
  read approved local topics
  write approved cloud topics

platform-operator:
  controlled administrative permissions

Topic-level authorization also affects event design. If sensitive and non-sensitive fields share one topic, every principal allowed to consume that topic can receive the full payload. Kafka ACLs do not provide field-level authorization.

Protect Data at Rest

TLS protects records while they travel over the network. It does not automatically encrypt the broker's stored log segments.

The normal flow is:

Encrypted network traffic
      |
      v
Broker decrypts record
      |
      v
Broker stores record on disk
      |
      v
Broker encrypts outgoing network traffic

Kafka does not provide built-in record encryption at rest. The organization must use another control.

Disk or filesystem encryption

Operating-system or infrastructure-level encryption protects Kafka data files on disk. Linux Unified Key Setup, or LUKS, is one on-premises option. Cloud providers offer managed disk-encryption capabilities.

This approach is transparent to Kafka clients and preserves normal broker processing.

End-to-end payload encryption

The producer encrypts selected fields or the entire payload before sending it. The consumer decrypts it after reading.

Producer encrypts payload
      |
      v
Kafka stores encrypted bytes
      |
      v
Authorized consumer decrypts payload

This can protect data even from broker-disk access, but it complicates:

  • Key distribution and rotation
  • Schema handling
  • Stream processing
  • Filtering and transformation
  • Operational debugging
  • Consumer authorization

If brokers or processing applications need to inspect the payload, full end-to-end encryption may be incompatible with that workflow.

The security design should distinguish network encryption, disk encryption, and payload encryption instead of treating them as one feature.

A Controlled Production Rollout

A production rollout should prove the architecture in stages.

Phase 1: Build the control plane

  1. Deploy the KRaft controller quorum.
  2. Verify unique node identities and listener reachability.
  3. Confirm the quorum has an active leader and synchronized followers.
  4. Test one controller failure.
  5. Test restoration of the failed controller.

Phase 2: Add brokers

  1. Deploy dedicated brokers.
  2. Verify broker-controller communication.
  3. Confirm partition placement and replication.
  4. Stop one broker.
  5. Verify partition leadership moves to eligible replicas.
  6. Confirm producers and consumers recover as designed.

Phase 3: Enable transport security

  1. Create certificates and trust relationships.
  2. Enable TLS on controller, broker, and client paths.
  3. Enable mTLS or SASL authentication.
  4. Test expired, unknown, and revoked credentials.
  5. Verify that plaintext listeners are removed or isolated according to policy.

Phase 4: Apply authorization

  1. Create one principal per application role.
  2. Grant only required topic, group, cluster, and transaction permissions.
  3. Test allowed operations.
  4. Test denied topic and consumer-group access.
  5. Verify administrative permissions are separate from application permissions.

Phase 5: Protect storage

  1. Enable disk or filesystem encryption.
  2. Define key ownership and rotation.
  3. Decide whether selected payload fields need end-to-end encryption.
  4. Verify that backup and recovery processes preserve protection.

Phase 6: Introduce hybrid replication

  1. Create the cloud cluster.
  2. Replicate only approved topics.
  3. Measure normal replication lag.
  4. Interrupt the network path.
  5. Verify local production continues.
  6. Restore connectivity and measure catch-up.
  7. Confirm that cloud consumers handle delayed arrival correctly.

Test Failure and Security Boundaries

Test metadata quorum loss

  1. Stop one controller in a three-controller quorum.
  2. Verify that metadata operations continue.
  3. Stop a second controller in a controlled test environment.
  4. Confirm that the quorum can no longer commit metadata changes.
  5. Restore controllers and verify consistent recovery.

Test active controller replacement

  1. Identify the active controller.
  2. Stop it.
  3. Observe a follower election.
  4. Verify that brokers reconnect to the new leader.
  5. Confirm that committed metadata remains intact.

Test broker failure

  1. Produce data continuously.
  2. Stop a broker hosting partition leaders.
  3. Verify leader reassignment.
  4. Confirm acknowledged records remain readable.
  5. Check client recovery and error rates.

Test unauthorized access

  1. Connect with a valid identity that lacks topic permission.
  2. Attempt a prohibited produce or consume operation.
  3. Confirm that Kafka denies the request.
  4. Verify the denial is visible in security monitoring.

Test certificate failure

  1. Connect with an untrusted certificate.
  2. Connect with an expired certificate.
  3. Confirm that TLS negotiation fails.
  4. Verify alerting and diagnostic information.

Test hybrid network interruption

  1. Stop connectivity between local and cloud environments.
  2. Verify local producers and consumers continue.
  3. Measure replication backlog.
  4. Restore the connection.
  5. Confirm replication catches up without bypassing security controls.

Common Mistakes

Running production controllers and brokers together without evaluating contention

Combined mode simplifies infrastructure but mixes control-plane and data-plane workloads.

Treating three controllers as tolerance for two failures

A three-member quorum requires two members for a majority. It tolerates one controller failure.

Stretching local operational traffic across an unreliable cloud connection

A single remote cluster can turn a network outage into a local application outage.

Choosing a managed service only to avoid setup work

Managed services reduce operations but also affect control, cost, portability, and compliance.

Encrypting only producer and consumer connections

Broker replication and controller communication also carry sensitive operational data and credentials.

Assuming TLS encrypts Kafka log files

TLS protects data in motion. Stored segments require disk, filesystem, cloud, or application-level encryption.

Reusing one broad principal for many applications

Shared identities make least privilege, auditing, revocation, and incident response harder.

Granting only topic read permission to consumers

Consumers also need the appropriate group and metadata-related permissions.

Putting sensitive and public fields in the same topic

Kafka authorization is usually resource-based, not field-based. Topic access exposes the complete record.

Hardcoding keystore and truststore passwords

Use environment substitution, restricted files, or a secret-management integration.

Waiting until production to enable security

Late security changes can alter listeners, identities, deployment automation, testing, and network rules. Enable representative controls early.

Production Checklist

Before approving the enterprise Kafka platform, confirm:

  • Controllers and brokers have intentionally selected roles.
  • Every node has a unique identifier.
  • The controller quorum uses a majority-based design.
  • Controller failover has been tested.
  • Broker failover has been tested separately.
  • ZooKeeper-era procedures are not assumed to apply to KRaft.
  • The deployment model follows latency, cost, compliance, and operational requirements.
  • Hybrid replication is isolated from local synchronous operation.
  • Replication lag is monitored.
  • Every listener has a documented protocol and network boundary.
  • Client, broker, controller, Connect, and registry communication is secured.
  • Authentication mechanisms match enterprise identity requirements.
  • ACLs follow least privilege for topics, groups, cluster operations, and transactions.
  • Sensitive data is separated according to topic-level authorization limits.
  • Passwords and private keys are stored securely.
  • Certificate issuance and rotation are operationalized.
  • Disk or filesystem encryption protects broker storage.
  • End-to-end payload encryption is used only where its processing tradeoffs are acceptable.
  • Failure, unauthorized-access, certificate, and network-interruption tests have passed.

Conclusion

Moving Kafka into an enterprise environment requires designing the control plane, deployment boundary, and security model as one system.

KRaft controllers protect cluster metadata through a majority quorum, while brokers handle partition storage and client traffic. Dedicated roles make those responsibilities easier to scale and troubleshoot. In a hybrid organization, separate local and cloud clusters can protect operational workloads from wide-area network failures while replication supplies cloud analytics asynchronously.

Security must cover every communication path, not only external clients. TLS protects data in motion, authentication establishes identity, ACLs enforce resource access, and disk or payload encryption protects stored records. None of these controls replaces the others.

When metadata failover, network placement, identity, authorization, secrets, and storage protection are tested together, Kafka can move from a permissive proof of concept to a production platform without turning one controller failure, cloud interruption, or stolen credential into an enterprise-wide incident.

Share:

Comments0

Home Profile Menu Sidebar
Top