Migrating Kafka Event Schemas Without Breaking Consumers or Trapping Poison Messages

A customer-profile service publishes events that are consumed by shipping, billing, support, fraud, and analytics applications. The first event version contains a customer identifier, name, email address, and delivery address. Months later, the business asks for a loyalty tier, decides that one field should be retired, and wants the address structure reorganized.

The producer team can update its application quickly, but the consumers are owned by different teams and are released on different schedules. Some consumers replay historical events. Others rebuild state from a compacted topic. A consumer may also be offline while the producer starts writing the new format.

Changing a class in the producer is therefore not a local refactoring. It changes a distributed contract. If the change is incompatible, consumers may fail during deserialization, repeatedly read the same invalid record, and stop progressing. Kafka cannot remove that single record or repair its payload inside the broker.

This tutorial designs a controlled schema-evolution workflow for a Kafka customer-profile pipeline. The goal is to let producers and consumers evolve independently while making breaking changes explicit, testable, and recoverable.

The System and Its Risk Boundary

Assume the platform contains these components:

ProfileService owns customer profile data and publishes profile events.
ShippingService needs names and delivery addresses.
BillingService needs invoicing details.
RiskService evaluates customer changes.
SupportProjectionService builds a searchable customer view.
Kafka stores and distributes the events.
Schema Registry stores versioned schemas and enforces compatibility rules.

The data flow is:

ProfileService
      |
      | serialize with registered schema
      v
Kafka topic: customer-profile-events
      |
      +----------------+----------------+----------------+
      |                |                |                |
      v                v                v                v
ShippingService  BillingService   RiskService   SupportProjectionService
      |                |                |                |
      +----------------+----------------+----------------+
                       |
                       v
               Schema Registry

Kafka brokers treat record keys and values as bytes. They do not inspect the business fields or verify that the payload matches the contract. Schema-aware serializers and deserializers perform that work in the clients.

A safe contract process must therefore protect three boundaries:

Design boundary: The event must represent the right business fact at the right granularity.
Compatibility boundary: New and old producer-consumer combinations must be understood before deployment.
Operational boundary: Invalid records and breaking migrations must have a recovery path.

Why a Shared Event Is Not the Producer's Internal Object

The easiest implementation is often to serialize the producer's domain entity directly. That creates a fragile contract because internal refactoring becomes externally visible.

The producer's database model may contain:

Internal flags
Persistence-specific identifiers
Sensitive fields
Relationships irrelevant to consumers
Data that changes for technical rather than business reasons

An event contract should expose information that represents a meaningful business occurrence, not a dump of the producer's internal state.

For a customer profile, the producer may emit an event after a profile is created or updated. The event should answer:

What happened?
Which customer was affected?
When did it happen?
What state or change is being communicated?
Which fields are safe and useful for other services?
Who owns the meaning of each field?

The triggering rule is part of the contract even though Kafka does not represent it directly. Teams need written agreement about when the event is emitted and what the event guarantees.

Choose the Event Shape Before Choosing the Schema Syntax

A schema can describe a bad event perfectly. The first design decision is therefore the event model.

Fact events for complete state

A fact event, also called a state event, carries the full externally relevant state of the entity at that moment.

For an infrequently updated customer profile, a fact event can be a practical choice:

CustomerProfileChanged
  eventId
  occurredAt
  customerId
  fullName
  email
  preferredLanguage
  loyaltyTier
  deliveryAddress

Advantages:

A consumer receives a complete picture.
A new consumer can understand the entity without reconstructing many small changes.
A consumer that missed an earlier event can recover from a later complete event.
The producer does not need separate payloads for every consumer.

Risks:

Records are larger.
Unneeded data may be exposed to every authorized consumer.
Sensitive fields require careful access-control decisions.
Repeating unchanged fields increases network and storage use.

Delta events for changed fields

A delta event carries only the change:

CustomerEmailChanged
  customerId
  previousEmail
  newEmail
  occurredAt

Advantages:

Smaller payloads
Clear description of a specific change
Reduced unnecessary data exposure

Risks:

Consumers may need earlier events or another state source.
Missed or out-of-order changes can produce an incorrect projection.
Rebuilding current state becomes more complicated.
The number of event types may grow quickly.

For the profile scenario, full-state fact events are reasonable when updates are uncommon and several consumers need different parts of the profile. Delta events become more attractive when entities are large, updates are frequent, or security requires narrower exposure.

Avoid notification-only events unless the synchronous dependency is intentional

A notification-only event says that something changed but does not carry the changed state.

CustomerProfileChanged
  customerId
  occurredAt

The consumer then calls ProfileService to retrieve the current profile.

This keeps the event small, but it reintroduces synchronous coupling:

Kafka notification
      |
      v
Consumer receives customerId
      |
      v
Consumer calls ProfileService
      |
      v
ProfileService returns current state

The consumer may observe a newer state than the one that triggered the notification. It also depends on the producer being available at processing time.

Use this pattern only when those tradeoffs are acceptable.

Define the Event Boundary: Composite, Atomic, or Aggregate

A customer profile contains related structures such as contact details, addresses, and preferences. The team must decide whether these changes belong in one event or several.

Composite event

A composite event includes related parts of one domain update.

CustomerProfileChanged
  customer details
  contact details
  delivery address
  communication preferences

This is useful when the parts are changed together and consumers need a coherent view. It reduces the risk that one consumer receives an address event before the corresponding profile event.

Atomic events

Atomic events describe small, individual changes:

CustomerCreated
DeliveryAddressChanged
PreferredLanguageChanged
MarketingConsentChanged

These events are precise, but consumers that need the full profile must coordinate multiple streams. If related events use different topics or partitions, their processing order may not match the business dependency.

Aggregate event

An aggregate event is produced after several lower-level events form a higher-level outcome.

CustomerRegistrationStarted
EmailVerified
DeliveryAddressValidated
      |
      v
CustomerRegistrationCompleted

An intermediate processing service can consume the lower-level events and publish the completed outcome. This makes the final event reusable by several consumers instead of forcing each consumer to implement the same aggregation logic.

For the profile contract, use a composite fact event when profile details should be treated as one externally consistent view. Use atomic events when individual changes have independent business meaning. Use an aggregate event when several events together represent a new outcome.

Select a Schema Format Deliberately

Kafka data contracts are commonly described with Avro, Protocol Buffers, or JSON Schema. All three can define structured records, but they make different tradeoffs.

Format	Encoded data	Strengths	Tradeoffs
Avro	Binary	Compact data, strong schema evolution, broad ecosystem	Payload is not human-readable
Protobuf	Binary	Compact, fast, strong compatibility model, generated classes	Field tag discipline is essential
JSON Schema	Text	Human-readable payloads, familiar debugging experience	Larger records and a more flexible content model

A team should choose based on:

Compatibility requirements
Record size
Performance needs
Language support
Human readability
Code generation preferences
Existing organizational tooling

For the customer-profile pipeline, assume the team selects Protobuf because it wants compact records, generated types, and explicit field tags.

Version 1

syntax = "proto3";

package customer.profile;

message CustomerProfileChanged {
  string event_id = 1;
  string occurred_at = 2;
  string customer_id = 3;
  string full_name = 4;
  string email = 5;
  string delivery_address = 6;
}

The field numbers are part of the binary contract. Once assigned, they must not be reused for a different meaning.

A compatible Version 2

The team adds an optional loyalty tier and stops using the old address field. It introduces a structured address with new field numbers.

syntax = "proto3";

package customer.profile;

message CustomerProfileChanged {
  string event_id = 1;
  string occurred_at = 2;
  string customer_id = 3;
  string full_name = 4;
  string email = 5;

  reserved 6;
  reserved "delivery_address";

  string loyalty_tier = 7;

  message Address {
    string street = 1;
    string city = 2;
    string postal_code = 3;
    string country_code = 4;
  }

  Address shipping_address = 8;
}

Reserving the removed tag and name prevents accidental reuse. Reusing field number 6 for unrelated data could cause old consumers to interpret the bytes incorrectly.

Treat the Producer as the Contract Owner

The producer team owns the domain and should own the event schema. Ownership does not mean making changes without consultation.

A practical responsibility model is:

Responsibility	Primary owner
Business meaning of the event	Producer domain team
Schema repository	Producer domain team
Compatibility policy	Producer team with platform governance
Consumer impact review	Producer and consumer teams
Registry availability and access	Platform team
Migration execution	Producer, consumer, and platform teams
Invalid-record monitoring	Owning application teams

The schema should live in a version-controlled repository that consumers can access. That repository becomes the reviewable source for:

Current schemas
Historical versions
Compatibility policy
Topic mapping
Ownership
Change rationale
Generated-code workflow
Release status

A wiki page alone is not enough because it can drift away from the deployed serializers.

Understand What Schema Registry Actually Does

Schema Registry is a separate service in the Kafka ecosystem. It stores schemas, versions them, and rejects registrations that violate the configured compatibility mode.

A typical write path is:

1. Producer serializer looks up the schema.
2. Schema Registry returns or registers a schema ID.
3. Producer writes the schema ID with the encoded payload.
4. Consumer reads the schema ID.
5. Consumer retrieves the matching writer schema.
6. Producer and consumer cache schemas locally.

The registry stores its own data in a compacted Kafka topic, commonly named _schemas. A registry deployment can run as a cluster so that reads and writes remain available during server failure.

The registry does not inspect every record inside the Kafka broker. If a producer bypasses the schema-aware serializer or writes corrupted bytes, Kafka can still accept the record.

That distinction is critical:

Schema Registry validates schema registration and supports client serialization. It is not broker-side payload validation.

Choose the Subject Naming Strategy Before Combining Event Types

A subject is the versioned schema identity on which compatibility rules are applied. The subject strategy determines how schemas relate to Kafka topics.

TopicNameStrategy

The subject is tied to the topic. This is the default strategy.

Use it when one topic contains one compatible schema family:

Topic:
customer-profile-events

Subject:
customer-profile-events-value

This is simple and gives consumers confidence that values in the topic belong to one evolving contract.

RecordNameStrategy

The subject is based on the fully qualified record name.

Use it when the same event type may appear in several topics and should evolve as one global schema identity.

Subject:
customer.profile.CustomerProfileChanged

The tradeoff is that a schema change affects every topic using that record identity.

TopicRecordNameStrategy

The subject combines the topic and record name.

Use it when one topic contains multiple event types and each type must evolve independently within that topic.

Subject:
customer-events-customer.profile.CustomerProfileChanged

This is useful for a polyglot topic, but consumers must inspect schema identity and handle several event types.

For the profile pipeline, TopicNameStrategy is the simplest choice because the topic is dedicated to one event family.

Select Compatibility from the Deployment Order

Compatibility is not an abstract label. It defines which application version must be deployed first.

Backward compatibility

A new consumer can read records written with the previous schema.

The deployment order is:

1. Register compatible schema.
2. Upgrade consumers.
3. Upgrade producer.

Typical compatible changes include:

Removing a field that the new consumer can ignore in old records
Adding a field with a default or optional representation

Backward compatibility is practical when consumer teams can be upgraded before the producer begins writing the new form.

Forward compatibility

An old consumer can read records written with the new schema.

The deployment order is:

1. Register compatible schema.
2. Upgrade producer.
3. Upgrade consumers later.

Typical compatible changes include:

Adding fields that old consumers ignore
Removing optional fields for which old consumers have defaults

Forward compatibility is useful when the producer must move first.

Full compatibility

Both new consumers reading old data and old consumers reading new data are supported.

This gives more deployment freedom, but permits a narrower set of changes. Adding and deleting optional fields is the common safe direction.

Transitive compatibility

A non-transitive check compares the new schema only with the preceding version. A transitive mode compares the new schema with every earlier version.

Transitive compatibility matters when:

Old records remain in Kafka for a long time.
Consumers replay historical data.
A data lake preserves records indefinitely.
Some consumers remain several versions behind.

For a compacted profile topic that may be rebuilt from records written by many historical producer versions, a transitive mode is safer than checking only against the latest schema.

Do Not Let the First Producer Register Production Contracts Accidentally

Some serializers can automatically register a schema when the producer sends its first record. This is convenient during experimentation, but risky in a controlled environment.

A developer can accidentally make the first production send define the contract. If the schema is wrong, the team may need another version immediately and may already have invalid or unwanted records in Kafka.

A controlled process registers the reviewed schema before the application release.

Confluent client properties can disable automatic registration and use the registry's existing version:

auto.register.schemas=false
use.latest.version=true

These settings are specific to Confluent tooling rather than Apache Kafka itself.

A safer release pipeline is:

Schema change proposed
        |
        v
Peer and consumer review
        |
        v
Compatibility check
        |
        v
Schema registered in target environment
        |
        v
Generated artifacts built
        |
        v
Consumers or producer deployed in required order
        |
        v
Production events enabled

Write access to the _schemas topic and schema-management API should be restricted. The schema store should also be included in backup and recovery planning because damaged or missing schemas can prevent clients from decoding existing records.

Test Every Producer-Consumer Version Combination

A passing schema registration is necessary, but it is not a complete application test.

For a backward-compatible release, test at least:

Writer	Reader	Expected result
Producer V1	Consumer V1	Existing behavior remains valid
Producer V1	Consumer V2	New consumer reads historical records
Producer V2	Consumer V2	New behavior works
Producer V2	Consumer V1	Allowed only if forward behavior is also required

Tests should verify business meaning, not only deserialization.

For example:

What value does Consumer V2 assign when loyalty_tier is absent?
Does an old consumer safely ignore shipping_address?
Can a projection rebuild from records written with several schema versions?
Are removed fields still required by a hidden consumer?
Does the consumer handle multiple schema IDs in one partition?
Can invalid data be isolated without blocking the partition?

A compatibility rule can confirm that bytes are readable while a semantic change still breaks the business. Renaming loyalty_tier to risk_level with the same type would be technically representable but logically wrong.

Prevent Poison Messages from Blocking a Partition

A poison message is a record that a consumer cannot deserialize or process successfully. Because Kafka does not delete one arbitrary record from the log, an unhandled poison message can create a loop:

Consumer reads offset 842
        |
        v
Deserialization fails
        |
        v
Offset is not committed
        |
        v
Consumer restarts or retries
        |
        v
Offset 842 is read again

The system needs an explicit invalid-record policy.

A practical workflow is:

Capture topic, partition, offset, key, schema ID, and error details.
Prevent the failing record from crashing the complete consumer process.
Redirect the record or its diagnostic representation to a dead-letter topic.
Commit or advance only according to the approved failure policy.
Alert on the dead-letter topic.
Investigate the producer and contract violation.
Replay or repair the business operation after correction.

A dead-letter topic without monitoring is only a hidden failure queue. Ownership and alert thresholds must be defined before production.

In an emergency, operators can identify the failing offset and reset the consumer group position with Kafka's consumer-group tooling. That is a recovery action, not a substitute for application-level error handling.

Handle a Breaking Change as a Migration

Some changes cannot satisfy the chosen compatibility mode:

Changing the meaning of a field
Changing a field to an incompatible type
Reorganizing the event around a different domain boundary
Replacing one event family with another
Removing data required by old consumers
Reusing identifiers or Protobuf tags incorrectly

Do not temporarily set compatibility to NONE and deploy casually. Treat the change as a migration.

Option 1: Drain and replace a non-compacted topic

For a topic with expiring events:

1. Stop the old producer.
2. Let consumers process the remaining old records.
3. Register the new contract through a controlled change.
4. Upgrade producer and consumers.
5. Resume traffic with the new format.

This can require downtime and careful coordination.

Option 2: Create a versioned topic

A safer approach creates a separate topic:

customer-profile-events-v1
customer-profile-events-v2

A phased migration can use dual writing or dual reading:

Phase 1:
Producer V1 -> Topic V1 -> Consumer V1

Phase 2:
Producer migration path -> Topic V1 and Topic V2
Consumer V1 reads Topic V1
Consumer V2 validates Topic V2

Phase 3:
Producer V2 -> Topic V2 -> Consumer V2

Phase 4:
Topic V1 and old applications are retired

This costs more operational effort but avoids forcing every application to switch at one instant.

Option 3: Migrate a compacted state topic

A compacted topic may contain the latest state for every key. A breaking change requires transforming the retained state, not merely waiting for old events to expire.

Use a dedicated migration service:

customer-profiles-v1
        |
        | consume every retained key and value
        v
ProfileMigrationService
        |
        | transform while preserving keys and timestamps
        v
customer-profiles-v2

The rollout becomes:

Create the V2 topic.
Migrate retained V1 records.
Validate key counts and transformed state.
Start the V2 producer or dual-write period.
Move consumers to V2.
Stop V1 traffic.
Retire V1 after verification and rollback windows close.

Preserving keys matters because compaction and partitioning depend on them. Preserving relevant timestamps helps maintain the intended event history and retention behaviour.

Build Governance into the Delivery Process

Schema governance should make the correct workflow easier than bypassing it.

A practical repository can contain:

customer-profile-contract/
  schemas/
    v1/
    v2/
  compatibility-policy.txt
  ownership.txt
  topic-mapping.txt
  change-log.txt
  consumer-impact.txt
  migration-plan.txt

The delivery process should require:

Producer-team ownership
Consumer impact review
Automated compatibility checks
Generated-code validation where applicable
Registration before deployment
Environment-specific promotion
Restricted registry write access
Dead-letter monitoring
A rollback or migration plan for significant changes
Documentation of business semantics, not only field types

Business analysts can define the required meaning and use cases. Developers or data engineers usually translate those requirements into technical schemas. Both groups are needed because a technically valid schema can still model the wrong business concept.

Common Mistakes

Publishing the internal database entity

This exposes implementation details and makes internal changes break external consumers.

Choosing delta events only to reduce payload size

Smaller records can create complex consumer reconstruction, ordering, and recovery logic.

Splitting related data into atomic events without an ordering plan

Consumers may receive dependent events in an unexpected order, especially across topics or partitions.

Treating Schema Registry as broker validation

A producer that bypasses the expected serializer can still write malformed bytes to Kafka.

Adding a required field without a rollout plan

Old records do not contain it, and old consumers may not understand it.

Checking only the latest schema version

A new schema can be compatible with V9 but incompatible with historical V1 data still retained in Kafka.

Reusing a Protobuf tag

An old consumer may interpret the new field as the old meaning.

Enabling automatic production registration

The first application instance can publish an unreviewed contract.

Hiding invalid records in an unmonitored dead-letter topic

Processing continues, but business data silently remains unresolved.

Making a breaking change in the original compacted topic

Historical retained state still uses the old format and must be transformed deliberately.

Release Checklist

Before changing a Kafka data contract, confirm:

The event represents a business fact rather than an internal entity.
The team chose fact, delta, notification, composite, atomic, or aggregate structure deliberately.
The producer domain team owns the contract.
Sensitive and unnecessary fields have been removed.
The schema format matches compatibility, performance, and tooling needs.
The subject naming strategy matches the topic design.
The compatibility mode matches the intended deployment order.
Transitive checks are enabled when historical versions remain relevant.
The schema is reviewed and registered before application deployment.
Automatic registration is disabled where controlled releases are required.
New and old producer-consumer combinations have been tested.
Business semantics are tested in addition to deserialization.
Poison-message handling and dead-letter monitoring are active.
Breaking changes use a topic or state-migration plan.
Keys and timestamps are preserved during compacted-state migration.
Registry access, backup, and recovery are defined.

Conclusion

Kafka data contracts fail in production when teams treat them as serialization details. A contract controls how independently deployed services understand the same business event across time.

Safe evolution begins with the event boundary: decide whether consumers need complete state, a narrow change, a composite view, or an aggregated outcome. Then choose a schema format, subject strategy, compatibility mode, and release order that match the system's operational reality.

Schema Registry can reject incompatible registrations and help clients decode historical records, but it cannot protect the pipeline from every malformed producer or every semantic mistake. Controlled registration, version-combination tests, poison-message handling, and explicit breaking-change migrations complete the design.

When these practices are built into delivery, adding a field becomes a reviewed system change instead of a production experiment.