Zero-Downtime Deployments: The Complete Implementation Guide

Every deployment is a risk. The old version stops, the new version starts, and somewhere in between, users might encounter errors. Zero-downtime deployment eliminates that gap. This guide covers the patterns, practices, and code needed to deploy without any user-visible interruption — from health checks and graceful shutdown to the hardest problem in deployments: database migrations that don't break your running application.

What Is Zero-Downtime Deployment?

Zero-downtime deployment is a release strategy where a new version of an application replaces the old version without any period of unavailability. Users experience no interruption during the transition. In-flight requests complete normally. New requests reach the new version once it is ready.

This is distinct from a traditional "stop-start" deployment, where the running process is terminated, the new version is deployed, and the application is restarted. During that window — even if it is only a few seconds — requests fail with 503 or 502 errors.

Why it matters:

User trust. A 502 error during a routine update erodes confidence. Users do not know whether a deployment is in progress or the application has broken entirely.
Search engine crawlers. Googlebot and other crawlers hit 502 errors during deployments and may interpret them as availability signals. Frequent crawl errors during deployment windows can affect indexing.
Business continuity. For applications with SLAs, any downtime counts against availability targets. Zero-downtime deployment removes an entire category of availability incident.
Developer confidence. When deployments are safe, teams deploy more frequently. Frequent deployments mean smaller changesets, which means less risk per deployment and faster iteration.

The cost of downtime scales with traffic. At 10,000 requests per minute, even a 30-second restart window means 5,000 failed requests. At higher volumes, a brief restart becomes a significant incident.

Deployment Strategies

Three main strategies implement zero-downtime deployment, each with different trade-offs in complexity, resource usage, and rollback capability.

Rolling Deployment

In a rolling deployment, new instances of your application start alongside the existing ones. Once a new instance passes its health check and is ready to serve traffic, the load balancer adds it to the rotation. Old instances are then removed one at a time as new instances come online to replace them.

This approach requires no additional infrastructure. If you run two instances of your application, a rolling deployment briefly runs three: the two original instances plus the first new instance before any old instance is removed.

Rolling deployment is the default behavior on Out Plane. When you push to your connected branch, the platform builds the new version, starts new instances, waits for health checks to pass, shifts traffic, and terminates the old instances in sequence.

The primary constraint of rolling deployment is that during the transition, both versions of your application run simultaneously. This means your old and new code must be able to coexist — particularly at the database layer. We cover this in detail in the database migrations section.

Blue-Green Deployment

Blue-green deployment maintains two identical production environments. At any given time, one environment (blue) is live and serving traffic. Deployments go to the idle environment (green). When the new version is ready and tested, traffic switches from blue to green at the load balancer level.

The key advantages are instant rollback and a clean transition: at no point are two different versions serving traffic simultaneously. If something is wrong with the green deployment, switching back to blue takes seconds.

The drawback is resource cost: you maintain two full production environments permanently, even though only one is active at any time. For large applications with significant infrastructure costs, this doubles the baseline expense.

Blue-green is well-suited for applications where simultaneous version coexistence is problematic — for example, applications with complex session state that cannot be shared across versions.

Canary Deployment

A canary deployment routes a small percentage of production traffic to the new version while the majority continues to reach the stable version. The new version is monitored for error rates, latency, and business metrics. If it behaves correctly, traffic is gradually increased until the new version handles 100% of requests and the old version is retired.

Canary deployments are valuable for high-risk changes where you want early signal from real traffic before full rollout. A 1% canary exposes you to real production conditions — edge cases, unusual request patterns, regional differences — without putting your full user base at risk.

The complexity cost is significant: you need traffic-splitting infrastructure, version-aware metrics, and a defined promotion process. For most teams, rolling deployment covers the zero-downtime requirement without this overhead.

Health Checks: The Foundation

Health checks are the mechanism that makes zero-downtime deployment safe. Without them, the platform has no way to know whether a new instance is ready to serve traffic before sending requests to it.

There are two distinct types of health checks, and understanding the difference matters.

Liveness checks answer the question: is this process alive? A liveness failure means the container should be restarted. A typical liveness check returns 200 if the process is running and responsive.

Readiness checks answer the question: is this process ready to accept traffic? A readiness failure means the instance should not receive requests yet, but it should not be killed. An instance might be alive (process is running) but not ready (still warming up a cache, establishing database connections, or loading configuration).

In zero-downtime deployment, the readiness check is critical. New instances should not receive traffic until they pass the readiness check. Old instances should continue receiving traffic until enough new instances are ready to replace them.

Node.js / Express

javascript

const express = require('express');
const { Pool } = require('pg');

const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

// Liveness check — is the process alive?
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness check — is the application ready to serve traffic?
app.get('/health/ready', async (req, res) => {
  try {
    await pool.query('SELECT 1');
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({ status: 'not ready', reason: 'database unavailable' });
  }
});

A single /health endpoint that checks database connectivity is a reasonable starting point. The distinction between liveness and readiness becomes more important as your application grows more complex.

Python / FastAPI

python

from fastapi import FastAPI, HTTPException
from sqlalchemy import text
from database import engine  # your SQLAlchemy engine

app = FastAPI()

@app.get("/health/live")
async def liveness():
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    try:
        with engine.connect() as conn:
            conn.execute(text("SELECT 1"))
        return {"status": "ready"}
    except Exception as e:
        raise HTTPException(
            status_code=503,
            detail={"status": "not ready", "reason": "database unavailable"}
        )

Go / net/http

package main

import (
    "database/sql"
    "encoding/json"
    "net/http"
)

func livenessHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]string{"status": "alive"})
    }
}

func readinessHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        if err := db.Ping(); err != nil {
            w.WriteHeader(http.StatusServiceUnavailable)
            json.NewEncoder(w).Encode(map[string]string{
                "status": "not ready",
                "reason": "database unavailable",
            })
            return
        }
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
    }
}

Keep health check handlers fast. They should complete in well under 100ms. If your health check takes two seconds to respond because it is querying a slow external dependency, you will create problems during high-frequency polling.

Graceful Shutdown

Graceful shutdown is the other half of zero-downtime deployment. When the platform sends a termination signal to an instance that is being replaced, the instance must:

Stop accepting new connections
Allow in-flight requests to complete
Close database connections cleanly
Exit

If your application ignores the termination signal and is forcibly killed, any requests currently being processed will fail. Those are user-visible errors, even if only for a few seconds.

Node.js Graceful Shutdown

javascript

const server = app.listen(PORT, () => {
  console.log(`Server listening on port ${PORT}`);
});

process.on('SIGTERM', () => {
  console.log('SIGTERM received, shutting down gracefully');

  // Stop accepting new connections
  server.close(async () => {
    console.log('HTTP server closed');

    // Close database connection pool
    await pool.end();
    console.log('Database pool closed');

    process.exit(0);
  });

  // Force exit if graceful shutdown takes too long
  setTimeout(() => {
    console.error('Graceful shutdown timed out, forcing exit');
    process.exit(1);
  }, 30000);
});

The timeout is important. If a request is stuck — for example, a long-running database query — you do not want the instance to hang indefinitely. Set the timeout to something slightly higher than your maximum expected request duration.

Python / FastAPI Graceful Shutdown

FastAPI applications running under uvicorn handle SIGTERM gracefully by default, completing in-flight requests before exiting. You can add application-level cleanup with lifespan events:

python

from contextlib import asynccontextmanager
from fastapi import FastAPI

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize resources
    await database.connect()
    yield
    # Shutdown: clean up resources
    await database.disconnect()

app = FastAPI(lifespan=lifespan)

Go Graceful Shutdown

package main

import (
    "context"
    "log"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    srv := &http.Server{Addr: ":8080", Handler: router}

    // Start server in a goroutine
    go func() {
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("HTTP server error: %v", err)
        }
    }()

    // Wait for termination signal
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    <-quit

    log.Println("Shutting down server...")

    // Allow 30 seconds for in-flight requests to complete
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        log.Fatalf("Server forced to shutdown: %v", err)
    }

    log.Println("Server exited cleanly")
}

Database Migrations Without Downtime

This is the hardest part of zero-downtime deployment. Application code is stateless and can be replaced at any time. The database is stateful, shared between old and new instances simultaneously during a rolling deployment, and schema changes can break one or both versions.

The Core Rule: Never Break the Running Version

During a rolling deployment, old and new application instances run simultaneously. Any database migration you run during this window must work correctly with both versions of the application code.

This rules out a large class of migration operations that seem straightforward: renaming a column, changing a column type, removing a column, or adding a required constraint. All of these can break the currently running version of your application.

The Two-Phase Migration Pattern

The solution is to split breaking changes into multiple deployments. This is sometimes called expand-contract migration or the two-phase migration pattern.

Phase 1: Expand (backward-compatible change)

Deploy a migration that adds new structure without removing anything. The existing version of the application continues to work because nothing it depends on has changed.

Phase 2: Migrate

If data needs to move between old and new structures, run that migration as a separate step. For large tables, this may need to be done in batches to avoid locking.

Phase 3: Contract (deploy new code)

Deploy the new version of the application that uses the new structure.

Phase 4: Cleanup

Once the new version is fully deployed and the old version is no longer running, remove the old structure.

Safe Migration Patterns

Adding a column: Safe. Use a default value or allow NULL. The old application ignores the new column; the new application reads and writes it.

sql

-- Safe: additive change, old code is unaffected
ALTER TABLE users ADD COLUMN preferences jsonb DEFAULT '{}';

Removing a column: Requires two deployments. First deploy application code that no longer references the column. Then drop the column in a subsequent migration after the old version is fully replaced.

Renaming a column: Three steps.

Add the new column.
Deploy application code that writes to both old and new columns, reads from new.
Backfill historical data from old to new.
Deploy application code that only uses the new column.
Drop the old column.

Changing a column type: Follow the same pattern as renaming — add a new column of the correct type alongside the old one, migrate data, switch application code, then drop the old column.

Adding a NOT NULL constraint: First add the column as nullable and backfill existing rows with appropriate values. Then add the constraint in a subsequent migration.

sql

-- Step 1: Add nullable column
ALTER TABLE orders ADD COLUMN status varchar(20);

-- Step 2: Backfill existing rows (run as separate migration)
UPDATE orders SET status = 'completed' WHERE status IS NULL;

-- Step 3: Add constraint after backfill and new code is deployed
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

Dangerous Migration Patterns

These patterns break running application instances and must not be applied during a live rolling deployment.

DROP COLUMN while old code still references it. Old instances will fail on any query that selects or writes to the dropped column.

ALTER COLUMN TYPE on a large table. PostgreSQL rewrites the entire table, holding an exclusive lock for the duration. Depending on table size, this can lock your application for minutes or longer.

Adding NOT NULL without a default or backfill. Any INSERT from the old version that does not include the new column will fail with a constraint violation.

Creating a unique index without CONCURRENTLY. A standard CREATE UNIQUE INDEX locks the table. Use CREATE UNIQUE INDEX CONCURRENTLY to build the index without blocking.

sql

-- Dangerous: locks the table
CREATE UNIQUE INDEX idx_users_email ON users(email);

-- Safe: builds concurrently without blocking writes
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email ON users(email);

For a deeper guide to PostgreSQL production operations, see our PostgreSQL production guide.

Handling Long-Running Requests

Standard HTTP requests complete in milliseconds to seconds and present no real challenge for graceful shutdown. Longer operations require specific handling.

WebSocket connections. Persistent connections will be terminated when an instance shuts down. Clients must reconnect automatically. Implement reconnection logic with exponential backoff on the client side so that reconnection happens without user intervention.

File uploads. Large file uploads in progress will be interrupted when the instance shuts down. Configure your shutdown timeout to be longer than your maximum expected upload duration, or use a pre-signed URL pattern where uploads go directly to object storage rather than through your application server.

Background jobs. If your application processes background jobs inline (rather than through a dedicated job queue), jobs in progress during shutdown will be terminated. The safer pattern is to use a job queue (such as BullMQ for Node.js or Celery for Python) that checkpoints job progress and allows jobs to be retried by another instance.

The shutdown timeout. Set your graceful shutdown timeout to be slightly higher than your p99 request latency. If your 99th percentile request completes in 5 seconds, a 30-second timeout gives significant headroom. The forced exit timeout prevents instances from hanging indefinitely if a request is genuinely stuck.

Application-Level Patterns

Zero-downtime deployment is not only an infrastructure concern. Several application-level patterns support the goal of safe, continuous deployment.

Backward-Compatible APIs

When you change an API, adding new fields to a response is safe — clients that do not know about the new field ignore it. Removing fields is not safe — clients that depend on the removed field will break.

If you need to make a breaking API change, version your endpoints (/v2/users) rather than modifying the existing endpoint. Deprecate the old version with sufficient notice, monitor actual usage before removing it, and then remove it only when usage reaches zero.

Feature Flags

Feature flags decouple deployment from feature release. You deploy code with a new feature disabled by default, then enable it through configuration without a redeployment.

This pattern has several advantages: if a feature causes problems, you disable the flag rather than rolling back an entire deployment. You can enable features for internal users before broad release. You can test changes with a small percentage of users before enabling them universally.

Feature flags can be implemented with a simple database table or environment variable for low-complexity cases, or with a dedicated feature flag service for more sophisticated rollout control.

Static Asset Versioning

During a rolling deployment, some users will receive HTML from the new version of your application while their browser has cached JavaScript and CSS from the old version. If your asset filenames are the same across deployments, users may execute old JavaScript against a page structure that has changed.

Content-hashed filenames solve this. When your build system generates main.a3f2c1.js instead of main.js, old and new assets can coexist safely. The CDN and browser cache serve the correct version based on the filename in the HTML. Next.js and most modern build tools do this by default.

Zero-Downtime Deployment on Out Plane

Out Plane implements rolling deployments automatically. You do not configure the deployment strategy — it is the default behavior for every deployment.

When you push to your connected branch, the platform:

Builds the new version from your source code
Starts new instances alongside the existing ones
Waits for the new instances to pass the health check
Begins routing traffic to the new instances
Terminates old instances after they complete in-flight requests

For zero-downtime to be guaranteed, two conditions must be met on your side:

Health check endpoint. Configure your application to expose a health check endpoint that returns 200 when ready to serve traffic. Without a health check, the platform uses a startup delay heuristic, which is less reliable. See the health checks section for implementation examples.

Minimum instances. Set minimum instances to 2 or higher in production. With a single instance, there is a brief transition period while the new instance comes up and the old one is still the only available instance. With two or more instances, at least one is always running while the new version starts.

If the new version's health check fails, Out Plane does not route traffic to it. The old version continues running, and the failed deployment does not cause downtime. This automatic rollback on health check failure is a safety net for deployments where the new version has a startup error.

You can connect your GitHub repository and configure auto-deploy in console.outplane.com. For details on the auto-deploy setup, see our guide on auto-deploying with GitHub.

Testing Your Zero-Downtime Setup

Configuring zero-downtime deployment is not enough — you need to verify it works before you depend on it in production.

Load test during a deployment. Use a tool like k6, wrk, or hey to generate continuous traffic against your application, then trigger a deployment. Monitor the error rate. A correctly configured zero-downtime deployment should produce zero 5xx errors during the transition.

bash

# Run a load test with k6 during a deployment
k6 run --vus 50 --duration 120s script.js

Monitor error rates during the deploy window. Your metrics should show zero increase in 5xx responses during deployment. A spike in 5xx errors during deployment indicates a gap in your health check configuration or graceful shutdown handling.

Check for dropped connections. Some load testing tools report connection errors separately from HTTP errors. Verify both are zero during deployment.

Test database migration safety in staging. Before applying a migration to production, run it against a staging environment that has a copy of production data. Measure how long the migration takes. If a migration on the staging copy of production data takes 30 seconds, it will likely take a similar amount of time in production — and any locking implications apply equally.

Add deployment tests to your CI pipeline. For high-traffic applications, automate the deployment load test in CI. Run it on every deploy and fail the deployment if the error rate exceeds a threshold. This prevents regressions in deployment safety as your application evolves.

For context on scaling patterns that complement zero-downtime deployment, see our guide on horizontal scaling for Node.js applications.

Deployment Checklist

Use this checklist before deploying to a production environment that requires zero-downtime deployment.

Summary

Zero-downtime deployment is achievable for any web application. The foundation is two things: a readiness health check that tells the platform when a new instance is ready, and graceful shutdown that allows in-flight requests to complete before the old instance exits.

Database migrations are the hardest part. The two-phase migration pattern — expand first, then contract — lets you make any schema change safely across a rolling deployment. The rule is simple: never apply a migration that breaks the currently running version of your application.

The application-level patterns (backward-compatible APIs, feature flags, content-hashed assets) reduce the risk of each deployment incrementally. Teams that practice zero-downtime deployment consistently tend to deploy more frequently, with smaller changes, which further reduces the risk of any individual release.

Ready to run zero-downtime deployments in production? Sign up for Out Plane and get $20 in free credit to get started.