Renaming a column on a busy table sounds simple until you try it. You edit the schema, generate a migration, deploy, and for thirty seconds in between, half your servers are throwing 500s.
Running multiple replicas doesn't fix this. Rolling deploys don't fix this. Your stateless app might be stateless, but your database isn't. The fix is a pattern called expand and contract, and it's the standard way to ship breaking schema changes without taking the system down.
What is expand and contract?
Expand and contract (also known as parallel change) is a database migration pattern that splits a single breaking schema change into a sequence of small, non-breaking ones. Instead of one risky migration, you ship three:
- Expand: Add the new schema alongside the old. Both shapes coexist.
- Migrate: Move data over and switch reads to the new shape.
- Contract: Remove the old schema once nothing uses it.
At every step, both the previous and the next version of your application code can run against the database without errors. That's the property that makes the deploy safe.
A useful analogy: it's the database equivalent of a lane shift on a highway. You don't close the road, paint new lines, and reopen it. You add a new lane, redirect traffic onto it gradually, then close the old one once it's empty.
Why naive migrations break in production
To understand why this pattern exists, look at what happens during a normal rolling deploy.
When you ship a new version of an app, there's a window where two versions are running side by side:
- Old instances still expect the old schema.
- New instances expect the new schema.
If your migration runs before the new instances are up, old ones crash because the column they're querying no longer exists. If it runs after, new ones crash because the column they expect isn't there yet. There is no moment in the deploy where a single, breaking schema change is safe.
The naive approach makes this worse: edit the model, autogenerate a migration, and the framework produces something that does everything in one step:
The only way out is to make sure both versions of the code work against the same database state at the same time. That's exactly what expand and contract guarantees.
How expand and contract works
Each phase of the pattern leaves the system in a state where every version of the code that's currently deployed can read and write successfully. The table below shows what's true at each phase, using a column rename as the example:
| Phase | Schema | App writes | App reads |
|---|---|---|---|
| Before | name | name | name |
| Expand | name, first_name, last_name | both | name |
| Migrate | name, first_name, last_name | both | new fields |
| Contract | first_name, last_name | new fields | new fields |
Notice that no two adjacent rows ever conflict. That's the invariant. Let's walk through each phase.
Phase 1: Expand
The expand phase adds the new schema without removing the old. The application is taught to write to both shapes but continues reading from the old one, so currently-running instances are unaffected.
First, add the new fields to the model as nullable:
Then generate the migration. It will look something like this:
Then deploy code that writes to both fields but reads from the old one:
After this deploy, every new row has both representations. Old instances are happy because name still works. New instances are happy because they're populating the new fields.
Phase 2: Migrate
The migrate phase moves existing data into the new shape and switches reads over. The trickiest part is the backfill: existing rows need their new fields populated. Run this in batches, never as a single unbounded UPDATE, because long-running updates hold locks that can stall the entire table.
A typical backfill looks like a small script or data migration that pages through rows in fixed-size chunks:
A few things matter here regardless of language or ORM: load the data through whatever historical/frozen schema your migration tool exposes, so the backfill keeps working even after future model changes; commit each batch in its own transaction; and for very large tables (tens of millions of rows), run the backfill out-of-band as a script or job rather than inline in a migration, with a small sleep between batches so replicas can keep up.
Then deploy code that reads from the new fields while still writing to both:
Continuing to write to name for one more deploy cycle is what makes this phase reversible. If the read switch causes problems, the rollback target still has name populated.
Phase 3: Contract
Once the read switch has soaked long enough to trust (give it at least a day, monitor your error rates), the old schema can be retired. First, stop writing to the old field:
Then remove the field from the model and generate the contract migration:
The migration is complete. Three deploys, zero downtime, and a clean rollback target available at every intermediate step.
Why this pattern matters
A schema change is a deployment of two coupled systems: the database and the application. The naive approach treats them as one unit, which only works if both can be swapped atomically. In any modern setup with rolling deploys, replicas, or canary releases, that atomicity doesn't exist, so the change has to be designed to be safe at every intermediate state.
This isn't theoretical. The most common cause of "the deploy took us down" incidents on stateful services is a schema change that assumed atomicity it didn't have. Expand and contract is what removes that assumption.
The pattern also pays for itself in less obvious ways. Because every step is independently shippable and reversible, you can pause, run extra observation, or roll back at any phase without scrambling. There is no point in the sequence where you've committed to finishing.
Rules that make it work
A few things to get right when applying the pattern in practice:
1. Every deploy must be backwards-compatible with the previous one. If version N can't handle the schema that version N-1 left behind, you don't have zero downtime. You have a fast outage.
2. New columns must be nullable or have a default. Adding a NOT NULL column without a default rewrites the whole table and locks writers on most engines. On Postgres 11+, NOT NULL with a default is fast; without one, it's a footgun. The safe move is to ship the column nullable, backfill, then tighten the constraint in a later migration.
3. Backfill in batches. A single UPDATE on a multi-million row table holds locks long enough to take down your app. Batch it, sleep between batches, and monitor replication lag if you have replicas.
4. Don't skip the soak time between phases. The temptation is to ship expand, backfill, and contract back-to-back in one afternoon. Don't. At least one full traffic cycle (usually 24 hours) between the read switch and the contract gives you a real chance to catch problems while rollback is still cheap.
5. Index before you backfill. If your backfill queries filter on a field, add the index in the expand phase. Use CREATE INDEX CONCURRENTLY on Postgres (or the equivalent online-DDL operation on your engine) so it doesn't lock the table.
When you can skip it
Expand and contract is overhead. Three deploys instead of one, a week of dual-writing, more code to review. The pattern is overkill when:
- The table is small enough to lock briefly (under 100k rows, low traffic).
- A real maintenance window is available.
- The change is purely additive: a nullable field with no code changes around it.
For everything else (renames, splits, type changes, table restructures on hot tables) this pattern is the only safe way.
Takeaways
- Zero downtime is a property of your migration sequence, not your infrastructure. Replicas and rolling deploys help, but they don't save you from a breaking schema change.
- Three deploys, not one. Expand, migrate, contract. Each is independently safe and reversible.
- Both old and new code must work against every intermediate schema state. This is the core invariant. Everything else follows from it.
- Backfill in batches, always. Unbounded updates on hot tables cause outages.
- Soak between phases. If you ship the whole sequence in an hour, you've just done a risky migration with extra steps.
The pattern is older than most of us. Martin Fowler wrote it up as parallel change in 2014, but it still gets skipped all over the industry. It usually surfaces the same way: an incident review, a timeline on a whiteboard, someone pointing at a slide saying "and this is where things started crashing."