Zero-Downtime Database Migrations at Scale
A battle-tested playbook for executing database migrations on high-traffic production systems without downtime, featuring real-world patterns and automation.
Database migrations are where DevOps theory meets harsh reality. You can have perfect CI/CD pipelines, immutable infrastructure, and blue-green deployments for your application layer—but when it’s time to migrate a production database serving millions of requests per second, all that elegance disappears. One wrong move and you’re looking at hours of downtime, data loss, or worse: silent data corruption that surfaces weeks later.
I’ve led database migrations ranging from simple schema changes on modest PostgreSQL instances to multi-petabyte data platform migrations across cloud providers. Each one taught me something new about what works, what fails spectacularly, and what “zero-downtime” actually means in production.
The Real Cost of Downtime
Before diving into patterns, let’s talk about why this matters. In my experience working with e-commerce, fintech, and SaaS platforms, I’ve seen the real costs:
- Revenue loss: For a mid-sized e-commerce platform doing $50M/year, every hour of downtime costs roughly $5,700 in lost sales
- Customer trust: Users who encounter errors during checkout have a 60%+ abandonment rate and may never return
- SLA penalties: Enterprise contracts often include financial penalties for downtime exceeding agreed thresholds
- Team burnout: Emergency migrations that go sideways mean 3 AM war rooms and exhausted engineers
The stakes are high enough that “we’ll just do it during a maintenance window” isn’t acceptable for most modern systems.
The Expand-Contract Pattern: Your Foundation
The expand-contract pattern is the cornerstone of zero-downtime database migrations. I’ve used this pattern successfully on databases ranging from 100GB to 50TB+, and it works because it separates concerns into distinct, reversible phases.
Phase 1: Expand (Additive Changes)
First, you make purely additive changes to the database schema. No deletions, no modifications—only additions.
-- Example: Renaming a column from 'user_name' to 'username'
-- Phase 1: Add the new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);
-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_username()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.username IS NULL AND NEW.user_name IS NOT NULL THEN
NEW.username := NEW.user_name;
END IF;
IF NEW.user_name IS NULL AND NEW.username IS NOT NULL THEN
NEW.user_name := NEW.username;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_username_trigger
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_username();
-- Backfill existing data (in batches to avoid locks)
-- This runs async, doesn't block the migration
At this point, your application still uses the old column (user_name), but the new column (username) exists and stays synchronized via triggers.
Phase 2: Migrate Application Code
Deploy application changes that write to both columns but read from the new one:
# Old code
user = User(user_name="john_doe")
# Transitional code (deploy this first)
user = User(
user_name="john_doe", # Still writing to old column
username="john_doe" # Also writing to new column
)
# Application reads from new column
print(user.username) # Not user.user_name
This phase is critical: your application is now using the new schema, but the old schema still works. You can roll back at any time by reverting the application deployment.
Phase 3: Contract (Remove Old Schema)
Only after the new code has been stable in production for days or weeks do you remove the old column:
-- Drop the trigger first
DROP TRIGGER IF EXISTS sync_username_trigger ON users;
DROP FUNCTION IF EXISTS sync_username();
-- Remove the old column
ALTER TABLE users DROP COLUMN user_name;
This three-phase approach gives you multiple rollback points and ensures you never have a moment where the database schema and application code are incompatible.
Handling Large-Scale Data Migrations
The expand-contract pattern works great for schema changes, but what about migrating 10TB of data from one database to another? Here’s the playbook I’ve used for cloud-to-cloud migrations.
The Dual-Write Pattern
For massive data migrations (like moving from AWS RDS to Google Cloud SQL), I use a dual-write strategy with eventual consistency:
┌─────────────────────────────────────────────────────┐
│ Application Layer │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Write Logic (Synchronous) │ │
│ │ 1. Write to Primary DB (Source) │ │
│ │ 2. Write to Secondary DB (Target) │ │
│ │ 3. Queue event for async verification │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Read Logic │ │
│ │ - Phase 1: Read from Source │ │
│ │ - Phase 2: Read from Target (after cutover) │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Source DB │ │ Target DB │
│ (AWS RDS) │ │ (GCP Cloud SQL) │
│ │ │ │
│ Historical Data │◄─────────────│ Async Backfill │
│ │ DMS/Debezium│ + Verification │
└──────────────────┘ └──────────────────┘
Implementation Strategy
Step 1: Initial Bulk Copy (Days to Weeks)
Use AWS Database Migration Service (DMS) or Debezium for initial replication:
# DMS Task Configuration (simplified)
migration_task:
source:
engine: postgres
endpoint: source-db.us-east-1.rds.amazonaws.com
target:
engine: postgres
endpoint: target-db.us-central1.sql.gcp.com
replication_config:
full_load: true
cdc_enabled: true # Capture ongoing changes
table_mappings:
- schema: public
tables: ["users", "orders", "products"]
performance:
parallel_threads: 8
batch_size: 10000
max_full_load_tasks: 4
Step 2: Enable Dual Writes
Modify your application to write to both databases:
class DualWriteUserRepository:
def __init__(self, primary_db, secondary_db, metrics):
self.primary = primary_db
self.secondary = secondary_db
self.metrics = metrics
async def create_user(self, user_data):
# Write to primary (source) - this must succeed
try:
primary_result = await self.primary.create(user_data)
except Exception as e:
self.metrics.increment('dual_write.primary.error')
raise
# Write to secondary (target) - failures are logged but don't block
try:
secondary_result = await self.secondary.create(user_data)
self.metrics.increment('dual_write.secondary.success')
except Exception as e:
self.metrics.increment('dual_write.secondary.error')
# Queue for retry, but don't fail the request
await self.queue_retry(user_data)
return primary_result
async def queue_retry(self, data):
"""Queue failed secondary writes for async retry"""
await self.retry_queue.publish({
'operation': 'create_user',
'data': data,
'timestamp': datetime.utcnow()
})
Step 3: Continuous Verification
This is where most teams fail. You need automated verification that both databases stay in sync:
# Verification worker (runs continuously)
class DataConsistencyVerifier:
async def verify_batch(self, table, offset, limit):
"""Compare a batch of records between source and target"""
# Fetch from both databases
source_records = await self.source_db.query(
f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
)
target_records = await self.target_db.query(
f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
)
# Compare checksums for efficiency
source_checksum = self._compute_checksum(source_records)
target_checksum = self._compute_checksum(target_records)
if source_checksum != target_checksum:
# Detailed comparison to find discrepancies
differences = self._find_differences(source_records, target_records)
for diff in differences:
await self.handle_inconsistency(diff)
self.metrics.increment(f'verification.{table}.inconsistent')
else:
self.metrics.increment(f'verification.{table}.consistent')
async def handle_inconsistency(self, diff):
"""Fix inconsistencies automatically or alert"""
if diff['type'] == 'missing_in_target':
# Auto-repair: copy from source to target
await self.target_db.upsert(diff['record'])
elif diff['type'] == 'data_mismatch':
# Alert for manual review
await self.alert(f"Data mismatch for ID {diff['id']}")
Step 4: The Cutover
After weeks of dual writes with verified consistency:
# Feature flag based cutover
class UserRepository:
def __init__(self, source_db, target_db, feature_flags):
self.source = source_db
self.target = target_db
self.flags = feature_flags
async def get_user(self, user_id):
# Gradual rollout: 1% -> 10% -> 50% -> 100%
if self.flags.is_enabled('read_from_target', user_id):
return await self.target.get(user_id)
else:
return await self.source.get(user_id)
This allows you to test the target database with production traffic while maintaining the ability to instantly roll back by toggling a feature flag.
Schema Migrations with Minimal Locking
One of the biggest challenges with PostgreSQL (and MySQL) is that many DDL operations acquire locks that block reads or writes. Here’s how I handle this:
Adding Indexes Without Blocking
-- Bad: Acquires lock, blocks writes during index creation
CREATE INDEX idx_users_email ON users(email);
-- Good: Creates index concurrently, doesn't block writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Even better: With error handling
DO $$
BEGIN
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email);
EXCEPTION
WHEN OTHERS THEN
-- Log the error, but don't fail the migration
RAISE WARNING 'Index creation failed: %', SQLERRM;
END $$;
For large tables (billions of rows), even CREATE INDEX CONCURRENTLY can take hours. I use a batched approach:
-- Create partial indexes first (fast)
CREATE INDEX CONCURRENTLY idx_users_email_recent
ON users(email)
WHERE created_at > '2024-01-01';
-- Then create the full index during lower traffic
CREATE INDEX CONCURRENTLY idx_users_email_full ON users(email);
-- Drop the partial index after full index is ready
DROP INDEX CONCURRENTLY idx_users_email_recent;
Changing Column Types Safely
Changing column types is notoriously lock-heavy. Here’s my pattern:
-- Changing users.age from INTEGER to BIGINT
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN age_new BIGINT;
-- Step 2: Backfill in batches (no locks)
DO $$
DECLARE
batch_size INT := 10000;
last_id BIGINT := 0;
rows_updated INT;
BEGIN
LOOP
UPDATE users
SET age_new = age
WHERE id > last_id
AND id <= last_id + batch_size
AND age_new IS NULL;
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
last_id := last_id + batch_size;
-- Brief pause to avoid overwhelming the database
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
-- Step 3: Add constraint to ensure new column is always populated
ALTER TABLE users ADD CONSTRAINT age_new_not_null
CHECK (age_new IS NOT NULL) NOT VALID;
-- Validate the constraint (can be done during low traffic)
ALTER TABLE users VALIDATE CONSTRAINT age_new_not_null;
-- Step 4: Deploy application code to use age_new
-- Step 5: Drop old column (fast, minimal lock)
ALTER TABLE users DROP COLUMN age;
ALTER TABLE users RENAME COLUMN age_new TO age;
Automation and Safety Rails
Manual migrations are error-prone. Here’s my automation framework using Terraform and CI/CD:
# terraform/database-migration.tf
resource "null_resource" "database_migration" {
triggers = {
migration_version = var.migration_version
}
provisioner "local-exec" {
command = <<-EOT
# Pre-migration checks
./scripts/pre-migration-check.sh
# Take snapshot (AWS RDS)
aws rds create-db-snapshot \
--db-instance-identifier ${var.db_instance} \
--db-snapshot-identifier pre-migration-${var.migration_version}
# Run migration with timeout and rollback on failure
timeout 3600 ./scripts/run-migration.sh ${var.migration_version} || {
echo "Migration failed, rolling back..."
./scripts/rollback-migration.sh ${var.migration_version}
exit 1
}
# Post-migration verification
./scripts/verify-migration.sh
EOT
}
}
The pre-migration check script validates everything:
#!/bin/bash
# scripts/pre-migration-check.sh
set -euo pipefail
echo "Running pre-migration checks..."
# Check database connectivity
psql "$DATABASE_URL" -c "SELECT 1" > /dev/null || {
echo "ERROR: Cannot connect to database"
exit 1
}
# Check replication lag (if applicable)
REPLICATION_LAG=$(psql "$DATABASE_URL" -t -c "
SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))::INT
")
if [ "$REPLICATION_LAG" -gt 60 ]; then
echo "ERROR: Replication lag is ${REPLICATION_LAG}s (max: 60s)"
exit 1
fi
# Check disk space (need at least 20% free)
DISK_USAGE=$(psql "$DATABASE_URL" -t -c "