12 min read
Dillon Browne

Zero-Downtime Database Migrations at Scale

A battle-tested playbook for executing database migrations on high-traffic production systems without downtime, featuring real-world patterns and automation.

Database Migration DevOps Zero Downtime PostgreSQL Cloud Architecture Automation Site Reliability Infrastructure as Code

Database migrations are where DevOps theory meets harsh reality. You can have perfect CI/CD pipelines, immutable infrastructure, and blue-green deployments for your application layer—but when it’s time to migrate a production database serving millions of requests per second, all that elegance disappears. One wrong move and you’re looking at hours of downtime, data loss, or worse: silent data corruption that surfaces weeks later.

I’ve led database migrations ranging from simple schema changes on modest PostgreSQL instances to multi-petabyte data platform migrations across cloud providers. Each one taught me something new about what works, what fails spectacularly, and what “zero-downtime” actually means in production.

The Real Cost of Downtime

Before diving into patterns, let’s talk about why this matters. In my experience working with e-commerce, fintech, and SaaS platforms, I’ve seen the real costs:

  • Revenue loss: For a mid-sized e-commerce platform doing $50M/year, every hour of downtime costs roughly $5,700 in lost sales
  • Customer trust: Users who encounter errors during checkout have a 60%+ abandonment rate and may never return
  • SLA penalties: Enterprise contracts often include financial penalties for downtime exceeding agreed thresholds
  • Team burnout: Emergency migrations that go sideways mean 3 AM war rooms and exhausted engineers

The stakes are high enough that “we’ll just do it during a maintenance window” isn’t acceptable for most modern systems.

The Expand-Contract Pattern: Your Foundation

The expand-contract pattern is the cornerstone of zero-downtime database migrations. I’ve used this pattern successfully on databases ranging from 100GB to 50TB+, and it works because it separates concerns into distinct, reversible phases.

Phase 1: Expand (Additive Changes)

First, you make purely additive changes to the database schema. No deletions, no modifications—only additions.

-- Example: Renaming a column from 'user_name' to 'username'
-- Phase 1: Add the new column
ALTER TABLE users ADD COLUMN username VARCHAR(255);

-- Create a trigger to keep both columns in sync
CREATE OR REPLACE FUNCTION sync_username()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.username IS NULL AND NEW.user_name IS NOT NULL THEN
    NEW.username := NEW.user_name;
  END IF;
  IF NEW.user_name IS NULL AND NEW.username IS NOT NULL THEN
    NEW.user_name := NEW.username;
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_username_trigger
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_username();

-- Backfill existing data (in batches to avoid locks)
-- This runs async, doesn't block the migration

At this point, your application still uses the old column (user_name), but the new column (username) exists and stays synchronized via triggers.

Phase 2: Migrate Application Code

Deploy application changes that write to both columns but read from the new one:

# Old code
user = User(user_name="john_doe")

# Transitional code (deploy this first)
user = User(
    user_name="john_doe",  # Still writing to old column
    username="john_doe"     # Also writing to new column
)

# Application reads from new column
print(user.username)  # Not user.user_name

This phase is critical: your application is now using the new schema, but the old schema still works. You can roll back at any time by reverting the application deployment.

Phase 3: Contract (Remove Old Schema)

Only after the new code has been stable in production for days or weeks do you remove the old column:

-- Drop the trigger first
DROP TRIGGER IF EXISTS sync_username_trigger ON users;
DROP FUNCTION IF EXISTS sync_username();

-- Remove the old column
ALTER TABLE users DROP COLUMN user_name;

This three-phase approach gives you multiple rollback points and ensures you never have a moment where the database schema and application code are incompatible.

Handling Large-Scale Data Migrations

The expand-contract pattern works great for schema changes, but what about migrating 10TB of data from one database to another? Here’s the playbook I’ve used for cloud-to-cloud migrations.

The Dual-Write Pattern

For massive data migrations (like moving from AWS RDS to Google Cloud SQL), I use a dual-write strategy with eventual consistency:

┌─────────────────────────────────────────────────────┐
│                  Application Layer                  │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │  Write Logic (Synchronous)                   │   │
│  │  1. Write to Primary DB (Source)             │   │
│  │  2. Write to Secondary DB (Target)           │   │
│  │  3. Queue event for async verification       │   │
│  └──────────────────────────────────────────────┘   │
│                                                     │
│  ┌──────────────────────────────────────────────┐   │
│  │  Read Logic                                  │   │
│  │  - Phase 1: Read from Source                 │   │
│  │  - Phase 2: Read from Target (after cutover) │   │
│  └──────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘
         │                                    │
         ▼                                    ▼
┌──────────────────┐              ┌──────────────────┐
│   Source DB      │              │   Target DB      │
│   (AWS RDS)      │              │  (GCP Cloud SQL) │
│                  │              │                  │
│  Historical Data │◄─────────────│  Async Backfill  │
│                  │  DMS/Debezium│  + Verification  │
└──────────────────┘              └──────────────────┘

Implementation Strategy

Step 1: Initial Bulk Copy (Days to Weeks)

Use AWS Database Migration Service (DMS) or Debezium for initial replication:

# DMS Task Configuration (simplified)
migration_task:
  source: 
    engine: postgres
    endpoint: source-db.us-east-1.rds.amazonaws.com
  target:
    engine: postgres
    endpoint: target-db.us-central1.sql.gcp.com
  
  replication_config:
    full_load: true
    cdc_enabled: true  # Capture ongoing changes
    
  table_mappings:
    - schema: public
      tables: ["users", "orders", "products"]
      
  performance:
    parallel_threads: 8
    batch_size: 10000
    max_full_load_tasks: 4

Step 2: Enable Dual Writes

Modify your application to write to both databases:

class DualWriteUserRepository:
    def __init__(self, primary_db, secondary_db, metrics):
        self.primary = primary_db
        self.secondary = secondary_db
        self.metrics = metrics
        
    async def create_user(self, user_data):
        # Write to primary (source) - this must succeed
        try:
            primary_result = await self.primary.create(user_data)
        except Exception as e:
            self.metrics.increment('dual_write.primary.error')
            raise
        
        # Write to secondary (target) - failures are logged but don't block
        try:
            secondary_result = await self.secondary.create(user_data)
            self.metrics.increment('dual_write.secondary.success')
        except Exception as e:
            self.metrics.increment('dual_write.secondary.error')
            # Queue for retry, but don't fail the request
            await self.queue_retry(user_data)
            
        return primary_result
    
    async def queue_retry(self, data):
        """Queue failed secondary writes for async retry"""
        await self.retry_queue.publish({
            'operation': 'create_user',
            'data': data,
            'timestamp': datetime.utcnow()
        })

Step 3: Continuous Verification

This is where most teams fail. You need automated verification that both databases stay in sync:

# Verification worker (runs continuously)
class DataConsistencyVerifier:
    async def verify_batch(self, table, offset, limit):
        """Compare a batch of records between source and target"""
        
        # Fetch from both databases
        source_records = await self.source_db.query(
            f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
        )
        target_records = await self.target_db.query(
            f"SELECT * FROM {table} ORDER BY id LIMIT {limit} OFFSET {offset}"
        )
        
        # Compare checksums for efficiency
        source_checksum = self._compute_checksum(source_records)
        target_checksum = self._compute_checksum(target_records)
        
        if source_checksum != target_checksum:
            # Detailed comparison to find discrepancies
            differences = self._find_differences(source_records, target_records)
            
            for diff in differences:
                await self.handle_inconsistency(diff)
                self.metrics.increment(f'verification.{table}.inconsistent')
        else:
            self.metrics.increment(f'verification.{table}.consistent')
    
    async def handle_inconsistency(self, diff):
        """Fix inconsistencies automatically or alert"""
        if diff['type'] == 'missing_in_target':
            # Auto-repair: copy from source to target
            await self.target_db.upsert(diff['record'])
        elif diff['type'] == 'data_mismatch':
            # Alert for manual review
            await self.alert(f"Data mismatch for ID {diff['id']}")

Step 4: The Cutover

After weeks of dual writes with verified consistency:

# Feature flag based cutover
class UserRepository:
    def __init__(self, source_db, target_db, feature_flags):
        self.source = source_db
        self.target = target_db
        self.flags = feature_flags
        
    async def get_user(self, user_id):
        # Gradual rollout: 1% -> 10% -> 50% -> 100%
        if self.flags.is_enabled('read_from_target', user_id):
            return await self.target.get(user_id)
        else:
            return await self.source.get(user_id)

This allows you to test the target database with production traffic while maintaining the ability to instantly roll back by toggling a feature flag.

Schema Migrations with Minimal Locking

One of the biggest challenges with PostgreSQL (and MySQL) is that many DDL operations acquire locks that block reads or writes. Here’s how I handle this:

Adding Indexes Without Blocking

-- Bad: Acquires lock, blocks writes during index creation
CREATE INDEX idx_users_email ON users(email);

-- Good: Creates index concurrently, doesn't block writes
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

-- Even better: With error handling
DO $$
BEGIN
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users(email);
EXCEPTION
    WHEN OTHERS THEN
        -- Log the error, but don't fail the migration
        RAISE WARNING 'Index creation failed: %', SQLERRM;
END $$;

For large tables (billions of rows), even CREATE INDEX CONCURRENTLY can take hours. I use a batched approach:

-- Create partial indexes first (fast)
CREATE INDEX CONCURRENTLY idx_users_email_recent 
ON users(email) 
WHERE created_at > '2024-01-01';

-- Then create the full index during lower traffic
CREATE INDEX CONCURRENTLY idx_users_email_full ON users(email);

-- Drop the partial index after full index is ready
DROP INDEX CONCURRENTLY idx_users_email_recent;

Changing Column Types Safely

Changing column types is notoriously lock-heavy. Here’s my pattern:

-- Changing users.age from INTEGER to BIGINT

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN age_new BIGINT;

-- Step 2: Backfill in batches (no locks)
DO $$
DECLARE
    batch_size INT := 10000;
    last_id BIGINT := 0;
    rows_updated INT;
BEGIN
    LOOP
        UPDATE users 
        SET age_new = age
        WHERE id > last_id 
          AND id <= last_id + batch_size
          AND age_new IS NULL;
        
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
        
        last_id := last_id + batch_size;
        
        -- Brief pause to avoid overwhelming the database
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

-- Step 3: Add constraint to ensure new column is always populated
ALTER TABLE users ADD CONSTRAINT age_new_not_null 
CHECK (age_new IS NOT NULL) NOT VALID;

-- Validate the constraint (can be done during low traffic)
ALTER TABLE users VALIDATE CONSTRAINT age_new_not_null;

-- Step 4: Deploy application code to use age_new

-- Step 5: Drop old column (fast, minimal lock)
ALTER TABLE users DROP COLUMN age;
ALTER TABLE users RENAME COLUMN age_new TO age;

Automation and Safety Rails

Manual migrations are error-prone. Here’s my automation framework using Terraform and CI/CD:

# terraform/database-migration.tf

resource "null_resource" "database_migration" {
  triggers = {
    migration_version = var.migration_version
  }
  
  provisioner "local-exec" {
    command = <<-EOT
      # Pre-migration checks
      ./scripts/pre-migration-check.sh
      
      # Take snapshot (AWS RDS)
      aws rds create-db-snapshot \
        --db-instance-identifier ${var.db_instance} \
        --db-snapshot-identifier pre-migration-${var.migration_version}
      
      # Run migration with timeout and rollback on failure
      timeout 3600 ./scripts/run-migration.sh ${var.migration_version} || {
        echo "Migration failed, rolling back..."
        ./scripts/rollback-migration.sh ${var.migration_version}
        exit 1
      }
      
      # Post-migration verification
      ./scripts/verify-migration.sh
    EOT
  }
}

The pre-migration check script validates everything:

#!/bin/bash
# scripts/pre-migration-check.sh

set -euo pipefail

echo "Running pre-migration checks..."

# Check database connectivity
psql "$DATABASE_URL" -c "SELECT 1" > /dev/null || {
    echo "ERROR: Cannot connect to database"
    exit 1
}

# Check replication lag (if applicable)
REPLICATION_LAG=$(psql "$DATABASE_URL" -t -c "
    SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))::INT
")

if [ "$REPLICATION_LAG" -gt 60 ]; then
    echo "ERROR: Replication lag is ${REPLICATION_LAG}s (max: 60s)"
    exit 1
fi

# Check disk space (need at least 20% free)
DISK_USAGE=$(psql "$DATABASE_URL" -t -c "

Found this helpful? Share it with others: