Schema Evolution & Data Migrations
Netflix's Data Gateway and the broader data platform handle schema evolution and data migrations by embracing abstraction, automation, and a schema-first, federated approach. The core principle is to manage these complex processes at the platform level, isolating application developers from the underlying database mechanics and changes. Schema evolution at the application layer The Data Gateway provides a stable, versioned API contract to application developers, which decouples them from changes in the underlying database schemas. Federated GraphQL: For many services, particularly for their API layer, Netflix uses a federated GraphQL architecture. Each microservice publishes its own schema fragment to a central schema registry. The API gateway aggregates these fragments into a single, federated graph for client consumption. Deprecation workflow: When a schema needs to change, the GraphQL deprecation feature is used. The schema registry tracks the usage of every field. Once usage statistics show that a deprecated field is no longer in use, a backward-incompatible change can be safely performed. Decoupled APIs: This schema-first approach deliberately decouples the client-facing GraphQL API from the underlying gRPC APIs and database schemas. This allows teams to evolve their services independently without forcing coordinated updates across the entire system. Schema evolution in the data platform For the asynchronous data movement pipelines within the Netflix Data Mesh, a more robust and automated system is in place. Avro and schema registry: The platform uses Apache Avro for a common, compact data format and maintains a schema registry to manage schema versions. This allows the platform to enforce strict schema validation and compatibility checks. Compatibility checks: The platform validates schema changes for compatibility. Incompatible changes, such as removing a field that a consumer depends on, are automatically flagged, and the pipeline is paused to notify the owner. This prevents downstream consumers from breaking unexpectedly. Automated pipeline updates: For compatible changes, the platform is designed to automatically propagate schema changes downstream and update pipelines without manual intervention. Consumer opt-in/opt-out: Consumers can choose how they handle schema evolution. They can "opt-in" to automatically accept new fields from the upstream source, or "opt-out" to only use a defined subset of fields. This gives control to the consumer while preserving flexibility. Handling data migrations Migrations are a necessary reality, whether for replacing a legacy database, updating a schema in place, or moving to a completely new system. Netflix's approach uses custom tooling and careful automation to minimize risk. Shadowing and dual-writes: For migrating from a legacy database to a new one, Netflix employs a dual-write and shadowing strategy. A "data integrator" service (often part of the Data Gateway) writes to both the old and new databases simultaneously. This allows the team to: Test the new database with production traffic. Compare the data written to both databases to catch discrepancies. Phased migration: The migration from Oracle to Cassandra was a multi-year effort that involved replicating data between different services and gradually transitioning functionality. For example, some early cloud migrations involved moving data with custom automation and leveraging services like AWS Database Migration Service. Canary deployments: The Data Gateway and other services use canary analysis during deployments. This allows Netflix to detect performance regressions and functional failures in a small, isolated environment before rolling out changes to the entire fleet. For instance, a bug involving multi-partition reads was caught and fixed within the gateway itself, without the application teams even noticing. Resilience and automation: The overarching strategy is to embrace failure and build resilient, automated tooling. The multi-year, large-scale migrations—like the monolith-to-microservices transition and the cloud migration—succeeded due to investments in sophisticated operational tools and automation