devops.
devops60 min read

Step 1: Bootstrap a baseline docker-compose stack with app and reverse proxy

What we're doing this step

Zero-downtime rollouts on a single host are not magic. They lean on two simple facts: traffic enters through one stable address, and the workload behind that address can be replaced one replica at a time. So before we touch anything that looks like a rolling update, we need a baseline stack where those two roles already exist as separate services. That is what this step builds — an HTTP app container plus an Nginx reverse proxy container, wired together with a user-defined Docker network, and pinned in place with a handful of structural tests that fail loudly the moment we drift from the shape later steps assume.

We are deliberately starting boring. The app is a stdlib http.server that answers /health with ok and / with a JSON payload identifying which replica served the request. The proxy is plain Nginx with one upstream block. No frameworks. No databases. The point is to keep the moving parts small enough that when we later add docker rollout, healthchecks, connection draining, and a swap of the upstream entries, every change is obvious in a diff.

Setup

Create the following layout under codebase/:

codebase/
├── docker-compose.yml
├── app/
│   ├── Dockerfile
│   └── server.py
├── nginx/
│   └── nginx.conf
└── tests/
    ├── test_app_server.py
    ├── test_compose_config.py
    └── test_nginx_config.py

Dev dependencies for the test harness go into pyproject.toml:

[dependency-groups]
dev = [
    "pytest>=8.0",
    "pyyaml>=6.0",
]

pyyaml is the only non-stdlib runtime piece, and it is dev-only — we use it to parse docker-compose.yml from the test suite so we can assert structural invariants without spinning up Docker.

Implementation

The app container

app/server.py is a 60-line HTTP service. Two routes, no framework:

def container_id() -> str:
    explicit = os.environ.get("APP_REPLICA_ID")
    if explicit:
        return explicit
    return socket.gethostname()


class AppHandler(BaseHTTPRequestHandler):
    def do_GET(self) -> None:
        if self.path == "/health":
            self._write(200, b"ok", "text/plain; charset=utf-8")
            return
        if self.path == "/":
            payload = {"replica": container_id(), "version": app_version()}
            body = json.dumps(payload).encode("utf-8")
            self._write(200, body, "application/json")
            return
        self._write(404, b"not found", "text/plain; charset=utf-8")

Two design choices worth flagging:

  1. /health is a separate route that returns the literal string ok. Later steps add a proper readiness gate, but every rollout strategy needs a cheap probe that can be hit thousands of times per minute by both the proxy and the rollout controller. Keeping it plain text means it parses correctly even before the JSON encoder is warm.
  2. The root route reports replica and version read from APP_REPLICA_ID and APP_VERSION. When we eventually have two replicas serving traffic side by side, we can curl / in a loop and watch the response shift from one replica id to the other. That is our zero-downtime evidence.

The Dockerfile is a five-line python:3.12-slim wrapper. No build tools, no layered virtualenv — the stdlib is all we need.

The reverse proxy

nginx/nginx.conf declares one upstream and one server block:

http {
    upstream app_backend {
        server app:8000;
        keepalive 16;
    }

    server {
        listen 80 default_server;

        location /health {
            proxy_pass http://app_backend/health;
        }

        location / {
            proxy_pass http://app_backend;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
        }
    }
}

The upstream entry resolves app via Docker's embedded DNS — the same name that the Compose service uses. That naming is what makes hot-swapping work later: when we bring up a second replica of app, Nginx will pick it up on the next DNS refresh without us editing this file. keepalive 16 keeps a small pool of warm TCP connections to the upstream, which we will want when we start measuring rollout latency.

Wiring it together

docker-compose.yml declares the two services on a shared bridge network:

services:
  app:
    build:
      context: ./app
    expose:
      - "8000"
    networks:
      - edge
  proxy:
    image: nginx:1.27-alpine
    depends_on:
      - app
    ports:
      - "8080:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    networks:
      - edge

networks:
  edge:
    driver: bridge

Three structural decisions are encoded here:

  • The app does not publish ports. Only expose. The proxy is the single ingress, which is what makes the eventual rotation safe.
  • nginx.conf is mounted read-only. Steps later in the series will rewrite this file at runtime to point at a new upstream; mounting it from the host (rather than baking it into the image) is what lets us reload without rebuilding.
  • A user-defined network (edge) is required so Docker's DNS resolves the service name app from inside the proxy container. The default network would also work, but being explicit means later steps can attach extra services without surprises.

Locking the shape in tests

tests/test_compose_config.py parses the YAML and asserts the invariants above — both services exist, the proxy depends on the app, port 80 is published, nginx.conf is mounted read-only, and the two services share a declared network. tests/test_nginx_config.py checks that the upstream block exists, references the compose service name, and listens on port 80. tests/test_app_server.py boots the handler in-process and verifies /health, /, and the 404 path.

These tests deliberately avoid running Docker. They are static-shape contracts: if a future step deletes the edge network, drops the read-only flag, or hardcodes a replica IP into nginx, the suite fails before anything ships.

Test it

.venv/bin/python -m pytest tests/
.................                                                        [100%]
17 passed in 1.25s

All seventeen invariants pass — five for the app handler, seven for the compose layout, five for the nginx config.

What we got

A two-container baseline where ingress is fully decoupled from the workload, the workload exposes a cheap health probe, and the structural shape of the stack is pinned by a fast test suite. Nothing here rolls anything yet — but every assumption that a zero-downtime rollout will rely on (single proxy ingress, service-name DNS, read-only mounted nginx config, dedicated bridge network) is now an enforced contract instead of a comment in a README.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: 3c9de0a

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy

What we're doing this step

The whole point of a rolling update is to remove a replica from the load balancer the instant it stops being able to serve traffic, and to add the new replica only once it actually can. That decision is not something the container runtime can guess. It needs an HTTP endpoint to ask. So in this step we split the single /health probe into two distinct signals — a liveness probe at /health that answers "is this process alive?" and a readiness probe at /ready that answers "would I want a real user landing on this replica right now?" — and we wire Docker's container-level healthcheck to the readiness one. We also flip depends_on to its long form so the proxy refuses to start until the app reports service_healthy, which closes the most embarrassing race in step 1: the proxy booting faster than the app and immediately 502-ing.

The reason this matters before any actual rollout work is that the rest of the series leans on condition: service_healthy as the single source of truth for "this replica is in rotation." When step 3 swaps an old replica out and a new one in, the rollout controller is going to wait for that same signal. If /ready lies — say, by returning 200 the instant the process binds the port — every cutover will drop the first few requests on the floor. We pay that complexity now, once, in the cheapest possible way: a small ReadinessState object inside the app and a few lines of YAML.

Setup

Two files change in codebase/ and one is added:

codebase/
├── app/
│   └── server.py            # add /ready + ReadinessState
├── docker-compose.yml       # add healthcheck + depends_on long form
└── tests/
    └── test_app_readiness.py   # new — pins readiness semantics

No new runtime dependencies. The readiness probe is plain stdlib http.server, and the Docker healthcheck shells out to the python interpreter that already lives inside the app image — no curl install, no extra base image.

Implementation

A readiness state separate from liveness

The minimum-viable readiness object has two inputs (a warm-up window and an explicit drain switch) and one output (is_ready()). We keep it small enough to fit on a screen:

class ReadinessState:
    def __init__(
        self,
        warmup_seconds: float = 0.0,
        clock: Callable[[], float] = time.monotonic,
    ) -> None:
        self._warmup = max(0.0, warmup_seconds)
        self._clock = clock
        self._origin = clock()
        self._forced_not_ready = False

    def mark_not_ready(self) -> None:
        self._forced_not_ready = True

    def mark_ready(self) -> None:
        self._forced_not_ready = False
        self._origin = self._clock() - self._warmup

    def is_ready(self) -> bool:
        if self._forced_not_ready:
            return False
        return (self._clock() - self._origin) >= self._warmup

Three design choices to call out:

  1. clock is injected. The default is time.monotonic, but the unit tests pass a fake clock so the warm-up window can be exercised in microseconds instead of seconds. Without that seam, any test that exercises the warmup behavior would have to sleep, and the suite would be slow and flaky.
  2. A drain flag overrides the warm-up timer. Once mark_not_ready() is called, no amount of elapsed time will flip the probe back to green — only an explicit mark_ready() will. This is what step 3 will use when it sends SIGTERM to an old replica and wants the proxy to stop sending it new connections immediately, even though the container is still alive and finishing its in-flight requests.
  3. mark_ready() rewinds _origin past the warm-up window. That sounds backwards but it is what lets you flip a previously-drained replica back into rotation without making the proxy wait another warm-up cycle for a replica that has already proven itself.

Two probes, two contracts

Inside the request handler we now expose three distinct GETs:

def do_GET(self) -> None:  # noqa: N802 (stdlib API)
    if self.path == "/health":
        self._write(200, b"ok", "text/plain; charset=utf-8")
        return
    if self.path == "/ready":
        self._handle_ready()
        return
    if self.path == "/":
        self._handle_root()
        return
    self._write(404, b"not found", "text/plain; charset=utf-8")

The /health route is unchanged from step 1 — it returns ok as long as the process is alive and accepting connections. We deliberately keep it stupid. It is the probe a future Kubernetes-style livenessProbe would use to decide "this process is wedged, kill it." Liveness should never reflect warm-up or drain, because if it did, a slow warm-up would cause a kill-restart-loop instead of just keeping traffic away for a few extra seconds.

The /ready route is the new one and it reflects the ReadinessState directly:

def _handle_ready(self) -> None:
    if self.readiness_state.is_ready():
        self._write(200, b"ready", "text/plain; charset=utf-8")
        return
    self._write(503, b"not ready", "text/plain; charset=utf-8")

503 — not 500 — is the right status here. 503 is "I am temporarily unable to serve this request, try a different upstream." Every reverse proxy and rollout tool worth using treats 503 from a readiness probe as a recoverable signal and will keep polling. 500 would suggest a code bug that needs a human, which is not what /ready reports.

The warm-up window itself is configurable from the environment so the Compose file can tune it per stack without rebuilding the image:

def _warmup_from_env() -> float:
    raw = os.environ.get("APP_READY_AFTER_SECONDS", "0")
    try:
        return max(0.0, float(raw))
    except ValueError:
        return 0.0


readiness = ReadinessState(warmup_seconds=_warmup_from_env())

A defensive try/except around float() is the entire error story: an unparseable value collapses to zero warm-up, which is the safest default for a healthcheck that the operator is about to gate cutover on. Crashing on a bad value would mean the container never reaches service_healthy at all, which is worse.

The Compose-level healthcheck

docker-compose.yml grows two blocks. The first is the actual healthcheck on the app service:

app:
  build:
    context: ./app
  environment:
    APP_PORT: "8000"
    APP_VERSION: "v1"
    APP_READY_AFTER_SECONDS: "0"
  expose:
    - "8000"
  networks:
    - edge
  healthcheck:
    test:
      - CMD-SHELL
      - python -c "import urllib.request,sys; r=urllib.request.urlopen('http://localhost:8000/ready', timeout=2); sys.exit(0 if r.status == 200 else 1)"
    interval: 10s
    timeout: 3s
    retries: 3
    start_period: 5s
  restart: unless-stopped

Every value in that healthcheck block was chosen on purpose:

  • test uses python -c rather than curl. The app image is python:3.12-slimcurl is not installed, and there is no reason to pull in 10 MB of system packages just to make one HTTP request when the interpreter that is already there can do it in three lines. urllib.request.urlopen raises on non-2xx by default, so a 503 turns into a non-zero exit code automatically; the explicit sys.exit(...) is for the happy path.
  • interval: 10s is a compromise. Probing more often (1s) detects failures faster but doubles the CPU floor of an otherwise idle stack. 10s is fast enough that a rollout doesn't visibly stall waiting for the next probe tick but slow enough that the probe traffic is invisible.
  • timeout: 3s is well above the actual /ready latency (microseconds) but well below interval. The gap matters: if timeout >= interval you can build up overlapping probe invocations and miss real outages.
  • retries: 3 means Docker will not flip a healthy container to unhealthy on a single 503. Three consecutive failures (30+ seconds) are required. This soaks up transient hiccups — a long GC pause, a momentary network blip — without booting a replica that is fine.
  • start_period: 5s is the grace window after docker run. During this period, failing probes do not count toward retries and do not mark the container unhealthy. That is what makes warm-up safe: a replica that needs 4s to come up will report 503 during start, the healthcheck will keep polling, and the container will report healthy on the first successful probe after the start window ends.

Gating the proxy on the app's health

The second compose change converts depends_on from the short form to the long form:

proxy:
  image: nginx:1.27-alpine
  depends_on:
    app:
      condition: service_healthy
  ports:
    - "8080:80"
  volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
  networks:
    - edge
  restart: unless-stopped

condition: service_healthy means docker compose up will not start the proxy container at all until the app's healthcheck has reported healthy at least once. In step 1 we had no healthcheck and the proxy would happily boot in parallel with the app, then 502 for the first second of the test. With the gate in place that race is gone — the proxy comes up into a stack where its upstream is already serving 200s.

restart: unless-stopped shows up on both services. It is not strictly part of "healthcheck" but it shares the same goal: once we promise zero-downtime, a single container crash should self-heal without an operator paging.

Tests that pin the new contracts

tests/test_app_readiness.py adds six new tests, all of them static or in-process — none of them spin up Docker:

  • test_ready_endpoint_returns_200_when_ready — happy path.
  • test_ready_endpoint_returns_503_when_drainedmark_not_ready() must flip /ready immediately.
  • test_health_endpoint_stays_200_even_when_not_ready — proves liveness and readiness are separate signals.
  • test_readiness_state_warmup_blocks_then_passes — uses the injected fake clock to advance time through the warm-up window without sleeping.
  • test_readiness_state_drain_overrides_warmup — drain wins over warm-up.
  • test_readiness_state_rejects_negative_warmup — bad input collapses to zero.

On the compose side, tests/test_compose_config.py gains three tests pinning the YAML invariants we just introduced:

  • test_app_service_declares_a_healthcheck — asserts the test command targets /ready, not /health. This catches the easy mistake of pointing the healthcheck at liveness.
  • test_app_healthcheck_tunes_timing_for_rolloutsinterval, timeout, retries, and start_period must all be set. Defaults are not good enough for a rollout-sensitive stack.
  • test_proxy_waits_for_app_to_be_healthydepends_on must be the long form and must specify condition: service_healthy. The short form passes lint but does not actually gate on health, and that distinction is invisible in code review without a test.

Test it

.venv/bin/python -m pytest tests/
..........................                                               [100%]
26 passed in 2.37s

Nine new tests on top of the seventeen from step 1 — six pin the readiness behavior of app/server.py, three pin the new compose-level invariants.

What we got

The app now distinguishes "the process is alive" from "I am willing to serve a real user", and Docker is wired to ask the second question on a 10-second cadence with a 5-second grace window and three-strikes tolerance. The proxy refuses to start until the app has answered that second question with a 200 at least once, so the obvious step-1 race ("proxy is faster than the app, returns 502 for the first second") is gone. The ReadinessState object also exposes a mark_not_ready() drain switch — useless on its own, but the hook that step 3 will pull on when it wants an old replica taken out of rotation while it finishes serving in-flight requests. None of this is yet a rollout, but every signal the rollout will lean on is now present, tested, and observable from outside the container.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: f69b280

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy
  • f69b280 — step 2: add healthcheck and readiness probe to the app service

What we're doing this step

The whole shape of a blue-green rollout is "two replicas, one of them taking traffic, the other one ready to take it." Up to this point the stack has had only one app container, so there is nowhere for traffic to go to during a cutover. This step fixes the simpler half of that problem: it adds a second app service — app_green — behind a Compose profile, then writes a tiny Python rollout planner that, given the currently active color, figures out which color to bring up next and constructs the exact docker compose up -d --wait command for it. The deliberate boundary of this step is that it stops before moving any user traffic. The planner spins green up alongside blue, watches Docker confirm the new replica is healthy, and then exits. Nothing about Nginx changes yet. Real users still hit blue. We are just proving that a second replica can exist, be reached, and report ready, all while the first replica continues to serve.

The reason for slicing the work this way is that the cutover step — swapping the upstream at the proxy — is the most failure-prone part of the whole series. If we tried to bring up green and switch traffic in a single commit, a regression in either half would be hard to bisect and easy to blame on the wrong piece. Splitting it lets step 3 own one crisp invariant: bringing up the new color must never touch the running one. The planner's compose command is forbidden from containing down, stop, rm, kill, or restart, and there is a test that fails the whole suite if anyone slips one in. Step 4 will then own a different, equally crisp invariant: the moment of switching upstreams must be atomic. Both invariants are easier to reason about — and to test — in isolation.

Setup

Three things change in codebase/. One Compose service is added, one Python module is created, and a new test file pins the planner's behaviour:

codebase/
├── docker-compose.yml         # add app_green service + color labels
├── scripts/
│   ├── __init__.py            # new — makes scripts/ an importable package
│   └── rollout.py             # new — the planner + CLI entrypoint
└── tests/
    └── test_rollout.py        # new — pins planner invariants in-process

No new runtime dependencies. The planner is pure stdlib (dataclasses, pathlib, shlex, subprocess), and the tests do not require Docker — every interesting branch is reachable by passing fake runners and temporary state files into the planning functions.

Implementation

A second service behind a profile

docker-compose.yml grows an app_green service that is, intentionally, a near-clone of app. The two replicas must build from the same image and join the same network — anything that diverges between them would let bugs hide on one color and not the other. The two real differences are the APP_REPLICA_ID environment variable (so an HTTP response identifies which color served it) and the profiles: [green] gate (so a plain docker compose up never starts both colors at the same time).

app:
  build:
    context: ./app
  environment:
    APP_PORT: "8000"
    APP_VERSION: "v1"
    APP_REPLICA_ID: "blue"
    APP_READY_AFTER_SECONDS: "0"
  expose:
    - "8000"
  networks:
    - edge
  labels:
    com.vytharion.rollout.color: "blue"
  healthcheck: *app-healthcheck
  restart: unless-stopped

app_green:
  build:
    context: ./app
  environment:
    APP_PORT: "8000"
    APP_VERSION: "v2"
    APP_REPLICA_ID: "green"
    APP_READY_AFTER_SECONDS: "0"
  expose:
    - "8000"
  networks:
    - edge
  labels:
    com.vytharion.rollout.color: "green"
  profiles:
    - green
  healthcheck: *app-healthcheck
  restart: unless-stopped

Two structural choices to call out:

  1. The healthcheck is a YAML anchor (*app-healthcheck) reused on both services. This is not just aesthetics. If the two colors had separately authored healthchecks, it would be easy for the green probe to drift — different timeout, different endpoint, different start_period — and then a green replica could be flapping in a way the blue replica never would. Sharing the literal definition collapses that whole class of bug into a YAML alias.
  2. Color is exposed two ways: as an environment variable inside the container and as a Docker label on the container. The env var is what makes the JSON response on / say "I am green." The label is what lets the rollout script — and, later, an operator running docker ps --filter — pick out exactly the container that belongs to a given color without parsing service names.

The profile gate is what makes the rest of the design safe. By default docker compose up only starts services with no profile (so just app), which means our existing tests and the existing proxy → app depends_on chain keep working unchanged. Green only ever comes up when something explicitly opts into the green profile, which the rollout script does and nothing else does.

A pure planner with one I/O seam

scripts/rollout.py opens with constants and the cheapest possible state store — a single file containing the literal word blue or green:

BLUE = "blue"
GREEN = "green"
COLORS: tuple[str, ...] = (BLUE, GREEN)

DEFAULT_STATE_PATH = Path(
    os.environ.get("ROLLOUT_STATE_PATH", "state/active-color")
)
DEFAULT_COMPOSE_FILE = Path(
    os.environ.get("ROLLOUT_COMPOSE_FILE", "docker-compose.yml")
)
DEFAULT_PROJECT_NAME = os.environ.get("ROLLOUT_PROJECT_NAME", "rollout")

Three small helpers carry the entire colour-swapping vocabulary:

def next_color(current: str) -> str:
    if current == BLUE:
        return GREEN
    if current == GREEN:
        return BLUE
    raise ValueError(f"unknown color: {current!r}")


def service_for_color(color: str) -> str:
    if color == BLUE:
        return "app"
    if color == GREEN:
        return "app_green"
    raise ValueError(f"unknown color: {color!r}")

The asymmetry between "color" (blue / green) and "service" (app / app_green) is deliberate. The legacy single-replica service is named app, and renaming it would break every other Compose feature that already points at it — depends_on, the proxy's upstream, the test suite. So the planner accepts the asymmetry rather than papering over it, and service_for_color is the single place where the mapping lives.

read_active_color defends against three failure modes — a missing state file (fresh machine), garbage content (operator typo), and stray whitespace (someone echo'd a value with a trailing newline):

def read_active_color(state_path: Path = DEFAULT_STATE_PATH) -> str:
    if not state_path.exists():
        return BLUE
    raw = state_path.read_text(encoding="utf-8").strip()
    if raw in COLORS:
        return raw
    return BLUE

All three failure modes collapse to the same safe default: blue. Blue is the only color that exists without a profile, so falling back to blue means the planner will always propose to bring up green next, which is exactly what an operator running rollouts for the first time would expect.

Building the compose command for the new color

The core planning primitive is compose_up_command, and the whole step-3 invariant lives inside it:

def compose_up_command(
    color: str,
    project_name: str = DEFAULT_PROJECT_NAME,
    compose_file: Path = DEFAULT_COMPOSE_FILE,
) -> tuple[str, ...]:
    service = service_for_color(color)
    parts: list[str] = [
        "docker",
        "compose",
        "--project-name",
        project_name,
        "-f",
        str(compose_file),
    ]
    if color == GREEN:
        parts.extend(["--profile", GREEN])
    parts.extend(["up", "-d", "--wait", service])
    return tuple(parts)

Two important features of this command:

  • It names a specific service at the end. docker compose up -d --wait <service> only starts the service you ask for — it does not recreate, restart, or stop any other service in the file. That is the single line of defense behind the "bring green up alongside blue" invariant. If we left the service name off, Compose would treat it as "bring up every service in this file," which would happily restart the currently-serving app container as a side effect.
  • --wait is non-negotiable. It tells Compose to block the CLI until every named container reports healthy (the same healthcheck from step 2). Without it, the command would return success the moment the container started, and the rollout script would happily declare green ready while it was still warming up. With it, the planner inherits step 2's readiness probe as its definition of "deployed."

The --profile green flag is only appended for green, because blue lives outside any profile. Compose treats profile flags as additive — passing --profile green while bringing up app would still work, but it would also become a documentation lie ("we only enable the green profile when deploying green"). Keeping the flag conditional makes the invocation log read honestly.

A frozen plan that the runner consumes

Everything above gets wrapped in a RolloutPlan dataclass — frozen, so once the planner has committed to a decision, no caller can mutate it midway through execution:

@dataclass(frozen=True)
class RolloutPlan:
    current_color: str
    next_color: str
    next_service: str
    compose_command: tuple[str, ...]

    def describe(self) -> str:
        return (
            f"current={self.current_color} next={self.next_color} "
            f"service={self.next_service} "
            f"cmd={shlex.join(self.compose_command)}"
        )


def plan_next_rollout(
    state_path: Path = DEFAULT_STATE_PATH,
    project_name: str = DEFAULT_PROJECT_NAME,
    compose_file: Path = DEFAULT_COMPOSE_FILE,
) -> RolloutPlan:
    current = read_active_color(state_path)
    nxt = next_color(current)
    cmd = compose_up_command(
        nxt,
        project_name=project_name,
        compose_file=compose_file,
    )
    return RolloutPlan(
        current_color=current,
        next_color=nxt,
        next_service=service_for_color(nxt),
        compose_command=cmd,
    )

The describe() line is small but matters: every real rollout will print exactly one line to stdout summarising what it is about to do, with the compose command rendered through shlex.join so an operator can copy-paste it into a shell to re-run the same step manually if something goes sideways.

Execution is a separate function that takes a Runner callable. The default runs subprocess.run; tests pass a fake that records the command without executing it:

Runner = Callable[[Sequence[str]], int]


def default_runner(cmd: Sequence[str]) -> int:
    completed = subprocess.run(list(cmd), check=False)
    return completed.returncode


def execute_plan(plan: RolloutPlan, runner: Runner = default_runner) -> int:
    return runner(plan.compose_command)


def main(argv: Iterable[str] | None = None) -> int:
    plan = plan_next_rollout()
    print(plan.describe())
    return execute_plan(plan)

That Runner seam is the entire reason the test suite can stay fast. Every test in tests/test_rollout.py either exercises the planner directly (no subprocess) or hands execute_plan a fake runner that records the argv it was passed and returns a stubbed exit code. There are zero tests that spin up Docker, which means the suite still runs in seconds and the CI environment does not need a Docker daemon.

Tests that pin the step-3 invariants

tests/test_rollout.py adds thirty new tests. The interesting ones are the ones that pin behaviour the rest of the series will rely on:

  • test_compose_up_command_does_not_stop_or_recreate_other_color — iterates over every color and asserts the resulting command contains none of down, stop, rm, kill, restart. This is the literal encoding of "green comes up alongside blue."
  • test_compose_up_command_runs_detached_and_waits_for_health — locks in both -d and --wait. Drop either and the rollout either blocks forever holding the foreground (without -d) or returns "success" the moment the container starts but before it is healthy (without --wait).
  • test_compose_up_command_for_green_includes_green_profile / test_compose_up_command_for_blue_does_not_use_profile_flag — encode the asymmetric profile rule directly.
  • test_plan_next_rollout_promotes_green_from_fresh_state and its blue-active counterpart — verify the round-trip: state says blue → plan deploys green, state says green → plan deploys blue.
  • test_plan_does_not_disturb_currently_active_color — the integration-flavoured statement of the same invariant, asserted at the RolloutPlan level rather than on the bare command.
  • test_execute_plan_propagates_runner_exit_code — a non-zero exit from docker compose up --wait must surface to the caller. Swallowing it would let a half-deployed green look like a success.

On the Compose side, tests/test_compose_config.py gains six new tests that pin the green service shape: test_compose_declares_a_green_app_service, test_green_service_is_gated_by_green_profile, test_blue_service_is_not_profile_gated, test_green_service_shares_blue_build_context, test_green_service_joins_the_shared_edge_network, test_green_service_declares_its_own_healthcheck, and test_color_labels_are_set_on_both_services. Together they make it impossible for green to drift away from blue in any way that would matter to the rollout — same image, same network, same readiness probe, just a different identity.

Test it

.venv/bin/python -m pytest tests/
...............................................................          [100%]
63 passed in 2.42s

Thirty-seven new tests on top of the twenty-six from steps 1 and 2: thirty in test_rollout.py plus seven new compose-shape tests in test_compose_config.py. The whole suite still runs in well under three seconds because nothing in the new module talks to Docker.

What we got

A blue-green rollout that does the half of the job which is safe to do in isolation: bring up a fresh app_green container, gated by a Compose profile so a plain docker compose up never touches it, and wait for its healthcheck to pass before declaring success. The planner that drives this is a pure-data dataclass with one I/O seam, which keeps every interesting branch reachable from in-process tests — no Docker daemon, no fixtures, no flake. The new color shares the same image, same network, and same readiness probe as blue, so when step 4 finally swaps the proxy's upstream entries from app to app_green, the only difference the user will notice is the APP_REPLICA_ID field flipping from blue to green in the JSON response. Crucially, nothing in this step touches the running blue replica — the compose command is literally forbidden, by test, from containing any verb that would. That separation is what is going to make step 4's atomic cutover easy to reason about, because by the time we get there the new color will already be warm, healthy, and reachable on the shared edge network, just not yet wired into the proxy.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: 0084713

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy
  • f69b280 — step 2: add healthcheck and readiness probe to the app service
  • 0084713 — step 3: write a blue-green rollout script that spins up the green container alongside blue

What we're doing this step

Step 3 left the stack in an unusual halfway state: a fresh green container running healthily on the shared edge network, while every actual request still landed on blue because the reverse proxy had no idea green existed. This step closes that gap by introducing the single, atomic moment of cutover — the instant the proxy stops sending bytes to app and starts sending them to app_green. The mechanism is deliberately small. The upstream app_backend { ... } block in nginx.conf no longer hardcodes server app:8000;; instead it includes a one-line snippet at /etc/nginx/conf.d/upstream.conf. The rollout script writes a new version of that snippet (pointing at the new color), then sends nginx a SIGHUP via nginx -s reload. SIGHUP spawns a fresh worker pool against the new config while the old pool finishes its in-flight requests and exits. There is no restart, no port rebind, no dropped TCP connection from the client's perspective. Only after the reload reports success does the script commit the new color to the on-disk state file, so a botched reload leaves the world consistent with what nginx is actually serving.

The reason this swap is worth its own step — rather than being bolted onto step 3 — is that the cutover surface is where almost every "we lost requests during deploy" bug in a blue-green setup actually lives. Restart the proxy instead of reloading it and you drop the listen socket for a few hundred milliseconds. Write the new snippet after the reload and an out-of-band SIGHUP races you to an empty file. Persist the new color before the reload and a failed reload leaves your bookkeeping pointing at a replica nginx never accepted. Each of those is a one-line mistake with a thirty-minute outage hiding inside it. Isolating the swap in its own commit, with its own pinned invariants, gives us a place to write tests against every one of those ordering rules.

Setup

Two files in codebase/ are touched in nginx-land, one in Compose-land, one in Python-land — plus three test files grow new sections that pin step-4 behaviour:

codebase/
├── docker-compose.yml              # mount nginx/conf.d into the proxy
├── nginx/
│   ├── nginx.conf                  # replace hardcoded `server` with `include`
│   └── conf.d/
│       └── upstream.conf           # new — baseline single-line snippet
├── scripts/
│   └── rollout.py                  # add cutover planner + executor
└── tests/
    ├── test_compose_config.py      # pin the conf.d mount
    ├── test_nginx_config.py        # pin the `include` seam + baseline snippet
    └── test_rollout.py             # pin cutover ordering + rollback semantics

No new runtime or test dependencies. Cutover state is still the same one-line file from step 3 (state/active-color), nginx is still the same nginx:1.27-alpine image from step 1, and the new planning code stays inside scripts/rollout.pydataclasses, pathlib, subprocess, shlex. Every interesting branch is reachable from a tmp_path fixture and a fake Runner; nothing in the suite needs Docker to evaluate.

Implementation

The cutover seam in nginx

The whole reason this can be one syscall is that nginx already supports it: an upstream block can include a snippet, and nginx -s reload re-evaluates included files when it forks new workers. So nginx.conf learns one change — drop the literal server line, point at a file that the rollout script will own:

http {
    upstream app_backend {
        include /etc/nginx/conf.d/upstream.conf;
        keepalive 16;
    }

    server {
        listen 80 default_server;

        location / {
            proxy_pass http://app_backend;
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header Connection "";
        }
    }
}

The repo ships a baseline nginx/conf.d/upstream.conf so a fresh docker compose up starts serving from blue without the rollout script ever running:

server app:8000;

Two structural rules to call out about that snippet. First, it is just a body — no upstream {} wrapper of its own — because it is spliced inside the existing upstream app_backend { ... } block. A nested upstream keyword would be an nginx syntax error, and the test test_upstream_snippet_has_no_outer_block exists precisely to keep someone from "tidying it up" with one. Second, it must point at the blue service (app) and never reference app_green on disk. Whether we are currently serving blue or green is the rollout script's runtime decision; the repo's checked-in default is always blue, so cloning the repo and starting Compose without any rollout work gives you a normal single-replica stack.

For the proxy container to see updates to that file, docker-compose.yml mounts the whole conf.d directory read-only:

proxy:
  image: nginx:1.27-alpine
  volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    - ./nginx/conf.d:/etc/nginx/conf.d:ro

Mounting the directory rather than the single file is important. A bind mount of a single file pins an inode; if the rollout script wrote the new snippet by replacing the file (which is what Path.write_text does on most filesystems), the inode would change and the container would still be looking at the old one. Mounting the directory means we are pointing at a path, not an inode, so any write the host performs is immediately visible inside the container.

A pure cutover plan

scripts/rollout.py grows three small pure helpers and one dataclass — every one of them defined so the tests can exercise them with no filesystem and no subprocess.

The first is the smallest. Given a color and a port, return exactly the body the snippet should contain:

def upstream_directive_for_color(
    color: str,
    port: int = DEFAULT_APP_PORT,
) -> str:
    service = service_for_color(color)
    return f"server {service}:{port};\n"

That is intentionally one line plus a trailing newline. Anything longer would be hard to diff in a test, and anything shorter (no newline) would make cat upstream.conf from the proxy look ugly without changing nginx's behaviour. The function reuses service_for_color from step 3, which already encodes the blue→app / green→app_green asymmetry, so there is no second place where that mapping has to stay in sync.

The second helper is the argv for asking nginx to reload, built out the same way compose_up_command was in step 3:

def nginx_reload_command(
    project_name: str = DEFAULT_PROJECT_NAME,
    compose_file: Path = DEFAULT_COMPOSE_FILE,
    proxy_service: str = DEFAULT_PROXY_SERVICE,
) -> tuple[str, ...]:
    return (
        "docker",
        "compose",
        "--project-name",
        project_name,
        "-f",
        str(compose_file),
        "exec",
        "-T",
        proxy_service,
        "nginx",
        "-s",
        "reload",
    )

Two non-obvious choices here. -T disables TTY allocation on docker compose exec, which is what makes the call work from a non-interactive CI shell — without it, Docker tries to attach a TTY and fails when stdin is not one. And -s reload is the entire atomicity story: SIGHUP, not SIGTERM, not docker restart proxy. A restart would tear down the listen socket; a reload preserves it. The test test_nginx_reload_command_uses_signal_reload_not_restart literally scans the argv for restart, stop, kill, down, and rm and fails the suite if any of them ever creep in.

The third piece wraps both helpers in a frozen plan:

@dataclass(frozen=True)
class CutoverPlan:
    from_color: str
    to_color: str
    upstream_path: Path
    upstream_body: str
    reload_command: tuple[str, ...]
    # Stored so a later inverse planner can rebuild the upstream body
    # without the caller having to remember which port was used.
    port: int = DEFAULT_APP_PORT

    def describe(self) -> str:
        body = self.upstream_body.strip().replace("\n", " | ")
        return (
            f"cutover from={self.from_color} to={self.to_color} "
            f"upstream={self.upstream_path} body={body!r} "
            f"reload={shlex.join(self.reload_command)}"
        )


def plan_cutover(
    state_path: Path = DEFAULT_STATE_PATH,
    upstream_path: Path = DEFAULT_UPSTREAM_CONF_PATH,
    project_name: str = DEFAULT_PROJECT_NAME,
    compose_file: Path = DEFAULT_COMPOSE_FILE,
    proxy_service: str = DEFAULT_PROXY_SERVICE,
    port: int = DEFAULT_APP_PORT,
) -> CutoverPlan:
    current = read_active_color(state_path)
    target = next_color(current)
    return CutoverPlan(
        from_color=current,
        to_color=target,
        upstream_path=upstream_path,
        upstream_body=upstream_directive_for_color(target, port=port),
        reload_command=nginx_reload_command(
            project_name=project_name,
            compose_file=compose_file,
            proxy_service=proxy_service,
        ),
        port=port,
    )

plan_cutover is deliberately I/O-free: it reads the state file (the same one step 3's planner reads), decides the target color, and packages everything else as data on a frozen dataclass. The test test_plan_cutover_does_not_touch_filesystem asserts that calling plan_cutover does not create the upstream snippet — that side effect belongs in execute_cutover, so any caller can inspect what is about to happen before it happens.

Executing the cutover in the right order

The executor is small but every line matters:

def execute_cutover(
    plan: CutoverPlan,
    runner: Runner = default_runner,
    state_path: Path = DEFAULT_STATE_PATH,
) -> int:
    plan.upstream_path.parent.mkdir(parents=True, exist_ok=True)
    plan.upstream_path.write_text(plan.upstream_body, encoding="utf-8")
    rc = runner(plan.reload_command)
    if rc != 0:
        return rc
    write_active_color(plan.to_color, state_path)
    return 0

Three ordering rules are encoded in those six executable lines:

  1. Write the snippet first, then reload. If an out-of-band operator runs nginx -s reload from another shell, or the container's own kill -HUP fires for any other reason, the on-disk snippet must already be the new value. Writing first means the worst case is that nginx re-reads the same config we were about to ask it to. Reloading first would mean nginx might re-read the old config, and we would have wasted the reload. The test test_execute_cutover_writes_upstream_then_reloads_then_persists_state enforces this by recording the file contents at the moment the runner fires and asserting it already matches the target color.
  2. Persist the new color only if the reload returned zero. If nginx rejects the new config (typo, missing upstream, network unreachable), the script returns the non-zero exit code without advancing state/active-color. The next plan_cutover will therefore propose the same target again, which is exactly the behaviour an operator wants while they investigate. The test test_execute_cutover_leaves_state_unchanged_when_reload_fails pins this.
  3. The runner's exit code is the executor's exit code. Step 3 already established this convention for execute_plan; step 4's execute_cutover mirrors it. test_execute_cutover_propagates_arbitrary_runner_exit_code makes sure no future "helpful" wrapper accidentally swallows a non-zero return.

A round-trip test — test_round_trip_two_cutovers_swap_color_twice — runs two cutovers in sequence against a temp directory and asserts the snippet ends up back at server app:8000; with state pointing at blue. That single test catches the whole class of "I forgot to update one of state or snippet" bugs, because if either side desyncs after the first cutover, the second cutover targets the wrong color and the assertions fail.

Wiring the new step into main

The CLI entrypoint that step 3 introduced grows two more lines — bring the new color up, then cut traffic over:

def main(argv: Iterable[str] | None = None) -> int:
    plan = plan_next_rollout()
    print(plan.describe())
    rc = execute_plan(plan)
    if rc != 0:
        return rc
    cutover = plan_cutover()
    print(cutover.describe())
    return execute_cutover(cutover)

This is the smallest possible composition: step 3's planner brings the new replica up; step 4's planner moves traffic to it. Each phase prints one human-readable summary line before it acts, so an operator following the deploy in CI logs can see exactly which compose command and which nginx reload command were about to run. If anything goes wrong, they can copy-paste either line into a shell and re-run it by hand.

Test it

.venv/bin/python -m pytest tests/
........................................................................ [ 80%]
..................                                                       [100%]
90 passed in 2.68s

Twenty-seven new tests on top of the sixty-three from steps 1–3: twenty-three around the cutover planner and executor in test_rollout.py, three around the include seam and baseline snippet in test_nginx_config.py, and one around the conf.d mount in test_compose_config.py. The whole suite still runs in under three seconds because nothing in the new code talks to a running nginx — the Runner seam from step 3 keeps every reload path stub-friendly.

What we got

A reverse proxy that can swap which app replica it sends traffic to in the time it takes nginx to fork a new worker pool — no restart, no socket churn, no in-flight request lost. The mechanism is one include directive in nginx.conf, one tiny snippet file the rollout script owns, and a SIGHUP. The cutover planner is pure data: plan_cutover reads the current color, computes the target color, and packages the upstream body plus the reload argv on a frozen CutoverPlan without touching disk. The executor performs the three side effects in exactly the order that makes a partial failure harmless: write the new snippet, reload nginx, persist the new color only if the reload succeeded. A failed reload leaves the on-disk state pointing at whatever the proxy is actually serving, so the operator never has to reconcile bookkeeping with reality by hand. With step 3's "bring green up alongside blue" invariant and step 4's "swap traffic atomically" invariant both pinned by tests, the rest of the series can stop worrying about the deploy mechanism and start worrying about the deploy policy — which is exactly what step 5's smoke-test-and-auto-rollback layer is for.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: 287d5fa

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy
  • f69b280 — step 2: add healthcheck and readiness probe to the app service
  • 0084713 — step 3: write a blue-green rollout script that spins up the green container alongside blue
  • 287d5fa — step 4: switch traffic atomically at the reverse proxy from blue to green

What we're doing this step

Step 4 left the stack in a state where the proxy can swap upstreams atomically — but "atomic swap" is a property of the mechanism, not of the deploy as a whole. The container can pass its readiness probe, nginx can fork a fresh worker pool against the new upstream, and the on-disk state file can roll forward to green — and the new replica can still be returning HTTP 500 to every real request because, say, the migration the new image relies on never ran. Step 5 closes that gap by adding a deliberately small post-cutover verification stage. After the new color is live behind the proxy, the rollout script issues a configurable HTTP probe against the public endpoint and waits for it to return the expected status. If the probe never matches inside the configured attempt budget, the script reverses the cutover: it rewrites the upstream snippet back to the previous color, asks nginx to reload, and only updates the on-disk active-color when nginx reports success. The exit code distinguishes "deploy succeeded" (0), "deploy aborted, previous color restored cleanly" (2), and "something broke that needs a human" (everything else), so a CI job can tell the three apart without parsing logs.

The reason this verification belongs in its own step — rather than being folded into step 4 — is that the smoke-test-and-rollback layer is policy, not mechanism. Step 4 owes the rest of the series the invariant the swap itself does not drop requests. Step 5 owes a different invariant: a broken new replica never stays in front of users. Those two invariants are tested in completely different ways. Step 4's tests care about argv shape and the ordering of file writes vs reload commands. Step 5's tests care about retry counts, timeouts, what happens when the probe can't even connect, and what happens when the rollback itself fails. Mixing them in one commit would force a single test file to assert on too many unrelated things at once, and would make the diff that introduces "auto rollback on smoke failure" larger than it needs to be. Keeping them separate gives the operator a clean revert target if they ever decide the smoke layer is too aggressive for their environment: just back out step 5 and the rest of the rollout still works exactly the same way.

Setup

One file in codebase/ grows new dataclasses, helpers, and an orchestrator function; the test file grows a new region of twenty-seven cases that pin step-5 behaviour. No other files are touched — the smoke layer is purely additive on top of the planner from steps 3 and 4.

codebase/
├── scripts/
│   └── rollout.py            # add SmokePlan, plan_smoke_test, run_smoke_test,
│   │                         # plan_rollback, execute_rollback, execute_rollout,
│   │                         # parse_args, and an updated main()
│   └── ...
└── tests/
    └── test_rollout.py       # twenty-seven new cases for the smoke + rollback layer

No new runtime or test dependencies. The probe uses urllib.request from the standard library so the deploy script keeps its zero-dependency footprint, and the time/retry machinery is split into a Prober seam and a Sleeper seam so unit tests can drive every retry path in microseconds without ever opening a socket or calling time.sleep. Configuration is read from environment variables (ROLLOUT_SMOKE_URL, ROLLOUT_SMOKE_STATUS, ROLLOUT_SMOKE_ATTEMPTS, ROLLOUT_SMOKE_TIMEOUT, ROLLOUT_SMOKE_DELAY) with CLI flags layered on top via argparse, which keeps the same code reusable from a developer's shell and from a CI workflow file.

Implementation

A frozen plan for the smoke probe

Following the same shape established in steps 3 and 4, the smoke layer starts from a pure-data plan. Everything the probe needs — the URL to hit, the status that counts as healthy, the retry budget, the per-attempt timeout, and the delay between attempts — lives on one frozen dataclass:

@dataclass(frozen=True)
class SmokePlan:
    probe_url: str
    expected_status: int
    attempts: int
    timeout_seconds: float
    attempt_delay_seconds: float

    def describe(self) -> str:
        return (
            f"smoke url={self.probe_url} expect={self.expected_status} "
            f"attempts={self.attempts} timeout={self.timeout_seconds}s "
            f"delay={self.attempt_delay_seconds}s"
        )

The builder applies conservative lower bounds before returning the plan. A zero-attempt smoke would skip probing entirely and the rollout would have no way to ever trigger an auto-rollback; a negative timeout would either crash urllib or hang forever; a negative inter-attempt delay would propagate straight into time.sleep and behave unpredictably across Python versions. Clamping in one place means every downstream caller — CLI, environment variable, library call — gets the same safety net:

def plan_smoke_test(
    probe_url: str = DEFAULT_SMOKE_URL,
    expected_status: int = DEFAULT_SMOKE_STATUS,
    attempts: int = DEFAULT_SMOKE_ATTEMPTS,
    timeout_seconds: float = DEFAULT_SMOKE_TIMEOUT,
    attempt_delay_seconds: float = DEFAULT_SMOKE_DELAY,
) -> SmokePlan:
    return SmokePlan(
        probe_url=probe_url,
        expected_status=expected_status,
        attempts=max(1, attempts),
        timeout_seconds=max(0.001, timeout_seconds),
        attempt_delay_seconds=max(0.0, attempt_delay_seconds),
    )

Three tiny tests in test_rollout.pytest_plan_smoke_test_clamps_attempts_to_at_least_one, test_plan_smoke_test_clamps_negative_timeout, test_plan_smoke_test_clamps_negative_delay — pin those clamps so a future "let me just remove this max() call" refactor breaks the suite loudly.

Probing without opening a socket

The probe itself is a one-function HTTP client that collapses every failure mode into "did we see the expected status or not". The default prober uses urllib.request and treats any exception as status 0:

def default_prober(url: str, timeout: float) -> int:
    req = urllib.request.Request(
        url, headers={"User-Agent": "rollout-smoke/1.0"}
    )
    try:
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            return int(resp.status)
    except urllib.error.HTTPError as exc:
        return int(exc.code)
    except Exception:
        return 0

Two design choices matter. First, an HTTPError (a real HTTP 4xx/5xx) is unwrapped to its numeric code rather than collapsed to zero — that way someone running a deploy that legitimately expects a 204 or a 403 from a private endpoint can configure the expected status and have the smoke pass without changes. Second, every other exception (connection refused, DNS failure, TLS error, timeout) becomes a 0, which the retry loop treats identically to a non-matching HTTP status. The caller never has to distinguish "the proxy returned 502" from "we could not even open a TCP socket to the proxy"; both mean "the new replica is not serving what we asked for, try again". The test test_default_prober_returns_zero_when_connection_refused monkeypatches urlopen to raise an OSError and asserts the function returns 0 instead of crashing.

The retry loop wraps the prober and is the only place in step 5 that actually iterates:

def run_smoke_test(
    plan: SmokePlan,
    prober: Prober = default_prober,
    sleeper: Sleeper = time.sleep,
) -> bool:
    last_index = plan.attempts - 1
    for attempt in range(plan.attempts):
        status = prober(plan.probe_url, plan.timeout_seconds)
        if status == plan.expected_status:
            return True
        if attempt < last_index:
            sleeper(plan.attempt_delay_seconds)
    return False

The interesting line is the if attempt < last_index: guard. After the final failed attempt there is no point waiting — the next thing the orchestrator does is start the rollback, and an extra delay_seconds of sleep would just slow that down. test_run_smoke_test_does_not_sleep_after_last_attempt records every sleep duration and asserts the list ends without a trailing one, locking that behaviour in. A short companion test, test_run_smoke_test_returns_true_on_first_matching_probe, asserts the sleeper is never invoked at all on the happy path — the function returns the instant the first probe matches, so a healthy deploy pays no retry cost.

Inverting a cutover into a rollback

The rollback is deliberately implemented as "do a normal cutover, but pointing the other way". The plan_rollback helper takes the CutoverPlan that was just executed and returns a new CutoverPlan whose from_color and to_color are swapped, whose upstream_body points at the original color again, and whose upstream_path and reload_command are unchanged:

def plan_rollback(cutover: CutoverPlan) -> CutoverPlan:
    return CutoverPlan(
        from_color=cutover.to_color,
        to_color=cutover.from_color,
        upstream_path=cutover.upstream_path,
        upstream_body=upstream_directive_for_color(
            cutover.from_color, port=cutover.port
        ),
        reload_command=cutover.reload_command,
        port=cutover.port,
    )

The reason for going through CutoverPlan rather than inventing a separate RollbackPlan type is that every safety property step 4 painstakingly pinned for cutover — write the snippet before reloading, only persist state when the reload returns zero, propagate the runner's exit code unchanged — applies to rollback for free. execute_rollback is a one-liner that delegates straight back to execute_cutover:

def execute_rollback(
    cutover: CutoverPlan,
    runner: Runner = default_runner,
    state_path: Path = DEFAULT_STATE_PATH,
) -> int:
    rollback = plan_rollback(cutover)
    return execute_cutover(rollback, runner=runner, state_path=state_path)

The test test_execute_rollback_does_not_advance_state_when_reload_fails is the one that actually validates this approach pays off. It runs a normal cutover (state → green), then runs the rollback against a runner that returns 5, and asserts the exit code is 5 and the state file still says green. That second assertion is the load-bearing one: if the rollback's nginx reload failed, the proxy is still serving green, and the bookkeeping had better agree with reality. Because execute_rollback is built on top of execute_cutover, the if rc != 0: return rc branch from step 4 protects rollback as well — the operator can rerun the deploy, see the same problem, and never lose track of which color the proxy is actually pointing at.

The orchestrator: bring up, cut over, smoke, maybe roll back

execute_rollout ties all four pieces together. It is short by design — every interesting decision happens inside the helpers it composes — but its exit-code contract is the public face of the deploy:

def execute_rollout(
    plan: RolloutPlan,
    cutover: CutoverPlan,
    smoke: SmokePlan,
    runner: Runner = default_runner,
    prober: Prober = default_prober,
    sleeper: Sleeper = time.sleep,
    state_path: Path = DEFAULT_STATE_PATH,
) -> int:
    rc = execute_plan(plan, runner=runner)
    if rc != 0:
        return rc
    rc = execute_cutover(cutover, runner=runner, state_path=state_path)
    if rc != 0:
        return rc
    if run_smoke_test(smoke, prober=prober, sleeper=sleeper):
        return 0
    rb_rc = execute_rollback(cutover, runner=runner, state_path=state_path)
    if rb_rc != 0:
        return rb_rc
    return ROLLED_BACK_EXIT_CODE

Three exit codes the rest of the toolchain can rely on:

  1. 0 — compose-up succeeded, reload succeeded, smoke passed. The new color is live and serving real traffic.
  2. ROLLED_BACK_EXIT_CODE (defined as 2) — compose-up and reload succeeded, smoke failed, and the rollback's reload also succeeded. The previous color is back in front of users and the on-disk state agrees. A CI job seeing this can fail the build loudly but does not need to wake a human at 2 AM.
  3. Anything else — a runner returned non-zero somewhere along the way. The state file may or may not have advanced; an operator must inspect. The exact code is whichever non-zero a Runner last returned, so the failure mode is preserved end-to-end.

Four tests pin the contract from four different angles. test_execute_rollout_happy_path_swaps_state_to_new_color runs the whole pipeline against a runner that always returns zero and a prober that always returns 200, and asserts the state file ends at green. test_execute_rollout_returns_rollback_code_when_smoke_fails swaps the prober for one returning 503 and asserts the exit code is ROLLED_BACK_EXIT_CODE, the state file is back at blue, and the upstream snippet points at app again. test_execute_rollout_runs_smoke_after_cutover records the order of every runner/prober invocation and asserts it is ["runner", "runner", "probe"] — compose-up, then nginx reload, then smoke, never the other way around. test_execute_rollout_skips_smoke_and_rollback_when_compose_up_fails and its reload-fails sibling check the early-exit branches: if a step before smoke fails, the probe must not fire and the rollback must not run, because there is nothing to roll back from.

Wiring the CLI

The main entrypoint from step 4 grows three lines to parse the new flags, build the SmokePlan, and hand all three plans to execute_rollout:

def main(argv: Iterable[str] | None = None) -> int:
    argv_list = None if argv is None else list(argv)
    args = parse_args(argv_list)
    plan = plan_next_rollout()
    cutover = plan_cutover()
    smoke = plan_smoke_test(
        probe_url=args.smoke_url,
        expected_status=args.smoke_status,
        attempts=args.smoke_attempts,
        timeout_seconds=args.smoke_timeout,
        attempt_delay_seconds=args.smoke_delay,
    )
    print(plan.describe())
    print(cutover.describe())
    print(smoke.describe())
    return execute_rollout(plan, cutover, smoke)

Five flags surface the smoke knobs: --smoke-url, --smoke-status, --smoke-attempts, --smoke-timeout, --smoke-delay. Each defaults to a module-level constant that itself reads from an environment variable, so the same script can be tuned three different ways — code default, env var, or CLI flag — without ever editing the file. test_parse_args_defaults_match_module_defaults and test_parse_args_supports_smoke_status_attempts_timeout_delay make sure that ladder stays consistent.

Test it

.venv/bin/python -m pytest tests/
........................................................................ [ 61%]
.............................................                           [100%]
117 passed in 2.49s

Twenty-seven new tests on top of the ninety from steps 1–4: every clamp inside plan_smoke_test, every branch inside run_smoke_test (happy first-try, retries-then-match, all-attempts-fail, no-trailing-sleep, zero-status-treated-as-failure), the default_prober failure path, every rollback ordering invariant, every exit-code branch in execute_rollout, and the argparse plumbing. The whole suite still finishes in under three seconds because no test in the new region opens a socket, calls time.sleep, or shells out — the Runner, Prober, and Sleeper seams keep every interesting path stub-friendly.

What we got

A rollout script that no longer trusts the container healthcheck to be the final word on whether a deploy worked. After step 4's atomic cutover, the script issues a configurable HTTP probe against the live endpoint, retries it on a budget, and treats any failure-to-match (HTTP error, timeout, connection refused, anything) as a deploy verdict of "no". If the verdict is no, the script reverses the cutover automatically — rewriting the upstream snippet back to the previous color and asking nginx to reload — and exits with a dedicated rollback exit code so CI jobs can distinguish "safely aborted" from "something genuinely broke". The rollback path reuses the cutover executor unchanged, which means every safety property step 4 pinned (write-before-reload ordering, no-state-update on reload failure, exit-code propagation) protects rollback for free. The SmokePlan/Prober/Sleeper seams keep the suite Docker-free and socket-free, so a CI runner without a daemon can still validate the whole pipeline end-to-end in under three seconds. With this layer in place, the deploy story is complete: bring the new color up alongside the old one, swap traffic atomically, probe the result, and undo the swap automatically if the probe disagrees — no partial state, no manual reconciliation, no broken replica left in front of users while an operator hunts for the right docker compose invocation.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: ca5bbdd

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy
  • f69b280 — step 2: add healthcheck and readiness probe to the app service
  • 0084713 — step 3: write a blue-green rollout script that spins up the green container alongside blue
  • 287d5fa — step 4: switch traffic atomically at the reverse proxy from blue to green
  • ca5bbdd — step 5: add post-cutover smoke tests with automatic rollback on failure

What we're doing this step

Steps 1 through 5 built up a complete zero-downtime deploy: a baseline compose stack with a reverse proxy, a healthcheck the proxy waits on, a blue/green planner that brings the new color up alongside the old one, an atomic upstream swap with nginx reload, and a post-cutover smoke test that auto-rolls-back when the new replica fails to answer. Each of those pieces is reachable today, but reaching them means typing the right Python module path, exporting half a dozen environment variables, and remembering the order of inituppython -m scripts.rollout — which is exactly the kind of operational knowledge that lives in one engineer's head until they leave the team. Step 6 packages the whole pipeline behind one stable entry point: make deploy. The Makefile is the seam where a CI workflow file and an operator's SSH session agree on the same shell-free, TTY-free, argument-explicit invocation. After this step, the CI YAML can shrink to run: make deploy and an operator chasing a hotfix can type ssh prod make -C /opt/svc deploy SMOKE_URL=https://app.example.com/healthz from a laptop with no extra environment setup.

Three properties drive every design decision in this step. First, the recipe must be non-interactive: CI runners and SSH sessions have no TTY, so a stray docker compose exec -it would silently freeze a deploy in the dark. Second, the recipe must be idempotent enough to retry: a flaky network during cutover should not leave the proxy half-swapped and the Makefile unable to recover, which means the only state the recipe owns lives in files the rollout module already reconciles. Third, the override surface must be flag-shaped, not script-shaped: a CI job tuning the smoke probe should be able to set SMOKE_URL=... inline without sed-editing the Makefile, and an SSH operator should be able to point at an alternate compose project with PROJECT_NAME=ci-7 without remembering an undocumented env var. The Makefile in this step is small — under a hundred lines including help — because it borrows every interesting decision from the modules behind it; its job is to give those modules one obvious door.

Setup

Two files at the codebase root: a new Makefile next to the existing docker-compose.yml, and a new tests/test_makefile.py next to the rest of the suite. No production code in scripts/ is touched — step 6 is purely a packaging layer over the planner and the smoke runner from steps 3 through 5.

codebase/
├── Makefile                 # new: help / init / build / up / down /
│                            #      deploy / status / logs / test / clean
├── docker-compose.yml       # unchanged from step 5
├── nginx/
├── scripts/
│   └── rollout.py           # unchanged: the deploy recipe is a thin wrapper
└── tests/
    └── test_makefile.py     # new: 19 dry-run assertions on the recipe shape

No new runtime dependencies. The Makefile is plain GNU make — no recursive includes, no shell helpers, no eval. The test file shells out to make -n (the dry-run flag) which prints every command make would execute without running it, then asserts on the resulting text. That keeps the suite docker-free and socket-free, the same discipline the rollout tests follow, so the whole step still validates in well under a second on a CI runner with no Docker daemon available. A single pytest.mark.skipif(shutil.which("make") is None, ...) guard at the top of the test module quietly skips the whole file on the rare host without GNU make installed instead of failing loudly.

Implementation

Anchoring paths to the Makefile, not the caller

The very first non-comment line in the recipe section pins where everything else resolves from:

ROOT := $(abspath $(dir $(lastword $(MAKEFILE_LIST))))

$(MAKEFILE_LIST) is a built-in that holds every Makefile make has parsed so far; $(lastword ...) plucks the one currently being read, $(dir ...) strips the filename, and $(abspath ...) turns the directory into an absolute path. The result is that ROOT always points at the directory containing this Makefile, no matter where the caller was when they invoked make. That single line is what makes ssh host make -C /opt/svc deploy behave identically to a local make deploy: the -C flag changes make's working directory before the recipe runs, but every path the recipe touches — compose file, state file, upstream snippet — flows from ROOT, not from $(CURDIR). The test test_makefile_anchors_paths_with_abspath_of_makefile_list reads the Makefile body and asserts both $(abspath and MAKEFILE_LIST appear in it, so a future refactor that "simplifies" the line away will trip a unit test before it ships.

Variables an external caller can override

Every knob a CI job or an operator might want to change is declared with ?=, which makes the assignment a default that an environment variable or a VAR=value argument can override:

PROJECT_NAME ?= rollout
COMPOSE_FILE ?= $(ROOT)/docker-compose.yml
COMPOSE := docker compose --project-name $(PROJECT_NAME) -f $(COMPOSE_FILE)

PYTHON ?= python3
ROLLOUT_MODULE := scripts.rollout

STATE_DIR ?= $(ROOT)/state
STATE_FILE ?= $(STATE_DIR)/active-color
UPSTREAM_CONF ?= $(ROOT)/nginx/conf.d/upstream.conf

PROJECT_NAME flows into --project-name so several deploys can share a host without colliding container names. COMPOSE_FILE lets a caller swap in an alternate compose file (for example, a CI job that injects an image tag override via a separate file). PYTHON is the one that matters most for SSH reuse: many production hosts deliberately omit python3 from the default PATH and only expose a versioned interpreter like /opt/python/bin/python3.11. Hard-coding python3 in the recipe would force every such host to symlink, alias, or wrap the call; making it a ?= variable means make deploy PYTHON=/opt/python/bin/python3.11 Just Works. The test test_deploy_uses_overridable_python_interpreter exercises that path explicitly.

Conditional smoke flags without empty arguments

The five smoke knobs introduced in step 5 — URL, expected status, attempts, timeout, delay — already have safe defaults inside rollout.py. The Makefile's job is to forward an override when one is present and omit the flag entirely when one isn't. A naive --smoke-url $(SMOKE_URL) would expand to --smoke-url (note the trailing space) when SMOKE_URL is empty, which crashes argparse because it sees a flag with no value. The fix is a per-knob ifneq guard:

SMOKE_FLAGS :=
ifneq ($(strip $(SMOKE_URL)),)
SMOKE_FLAGS += --smoke-url $(SMOKE_URL)
endif
ifneq ($(strip $(SMOKE_STATUS)),)
SMOKE_FLAGS += --smoke-status $(SMOKE_STATUS)
endif
ifneq ($(strip $(SMOKE_ATTEMPTS)),)
SMOKE_FLAGS += --smoke-attempts $(SMOKE_ATTEMPTS)
endif
ifneq ($(strip $(SMOKE_TIMEOUT)),)
SMOKE_FLAGS += --smoke-timeout $(SMOKE_TIMEOUT)
endif
ifneq ($(strip $(SMOKE_DELAY)),)
SMOKE_FLAGS += --smoke-delay $(SMOKE_DELAY)
endif

$(strip ...) collapses whitespace so a caller passing SMOKE_URL=" " does not accidentally trip the conditional. Each knob is forwarded independently, so a CI job can tune just the retry budget (make deploy SMOKE_ATTEMPTS=10) without having to also pass a URL and a status it does not want to change. The tests test_deploy_passes_smoke_url_when_overridden, test_deploy_omits_smoke_flag_when_override_absent, and test_deploy_forwards_every_smoke_knob_independently pin all three behaviours: forward when set, omit when unset, and never collapse two unrelated knobs into one flag.

The deploy target itself

With the variables in place, deploy is a single recipe whose body does nothing surprising — every interesting decision already happened upstream:

deploy: init  ## Run a zero-downtime blue/green rollout end-to-end
	cd $(ROOT) && \
		ROLLOUT_PROJECT_NAME=$(PROJECT_NAME) \
		ROLLOUT_COMPOSE_FILE=$(COMPOSE_FILE) \
		ROLLOUT_STATE_PATH=$(STATE_FILE) \
		ROLLOUT_UPSTREAM_CONF=$(UPSTREAM_CONF) \
		PYTHONPATH=$(ROOT) \
		$(PYTHON) -m $(ROLLOUT_MODULE) $(SMOKE_FLAGS)

Four properties are load-bearing. deploy: init declares a prerequisite so the first deploy on a fresh host seeds the state file and the upstream snippet before rollout.py reads them — without that, nginx would boot against an empty include and 502 every request. The exported ROLLOUT_* environment variables are the contract scripts.rollout already reads, so the Makefile does not invent a parallel configuration surface; it just maps its own variable names to the names the script expects. PYTHONPATH=$(ROOT) lets python -m scripts.rollout resolve the package even when the caller has not activated a virtualenv. $(SMOKE_FLAGS) at the end is what makes the override surface composable: a vanilla make deploy calls python -m scripts.rollout with no smoke flags, and any SMOKE_* override silently appears in the argv. The tests test_deploy_target_invokes_rollout_module, test_deploy_depends_on_init, test_deploy_propagates_project_name_to_rollout_env, and test_deploy_propagates_compose_file_to_rollout_env lock every one of those properties.

A separate negative test, test_deploy_does_not_invoke_destructive_compose_commands, asserts the dry-run output of make deploy contains none of compose down, compose stop, compose kill, compose rm, or compose restart. The whole point of steps 3 through 5 is to avoid dropping traffic, and a future refactor that adds, say, a "clean previous color before deploying" line could silently undo all that work; pinning the absence of the destructive verbs makes that mistake a test failure rather than a production incident.

help as the default goal, .PHONY for every recipe

Two small but consequential lines near the top of the file:

.DEFAULT_GOAL := help
.PHONY: help init build up down deploy status logs test clean

.DEFAULT_GOAL := help means a bare make over SSH prints the target list instead of silently triggering a deploy. That matters because an operator typing ssh host make (no target) by mistake should see a banner, not a rollout. The help recipe itself is a one-line awk that scans the Makefile for ## doc comments and pretty-prints them, so adding a new target with a ## description suffix surfaces in make help automatically with no additional bookkeeping. The .PHONY declaration protects every target name from being short-circuited by a stray file of the same name in the working tree — if someone accidentally creates a file called deploy (for example, an editor swap file), make would otherwise think the target is up-to-date and skip the recipe entirely. The test test_makefile_declares_phony_targets walks every .PHONY: line and asserts each of help, init, build, up, down, deploy, and test is present.

Bootstrap (up) vs. deploy

make up and make deploy look superficially similar — both bring containers up — but they answer different questions. up is the first-time bootstrap: on a fresh host, it seeds the state file to blue, writes a starter upstream snippet that points at the app service, and brings up the blue replica and the proxy. It explicitly does NOT bring up app_green, because green is a deploy-time artefact created by the rollout planner when a new image is being pushed. deploy, by contrast, is what every subsequent release runs: the bootstrap already happened, the active color is whatever the state file says, and the planner creates the inactive color alongside it. Two tests pin the difference: test_up_target_brings_blue_alongside_proxy_only asserts make -n up mentions app and proxy but never app_green, and test_up_depends_on_init asserts the bootstrap recipe also runs the init seeding so nginx never boots against an empty include.

clean is bookkeeping, not teardown

The last target worth highlighting is the one that most easily becomes a footgun:

clean:  ## Remove on-host rollout state (does NOT touch containers)
	rm -rf $(STATE_DIR)
	rm -f $(UPSTREAM_CONF)

clean removes the active-color file and the generated upstream snippet — the two pieces of bookkeeping the rollout owns — and pointedly does not invoke any docker command. An operator who wants to reset the deploy bookkeeping after manually fixing the state mid-incident should be able to do so without taking down live containers. test_clean_target_removes_state_only_not_containers parses the dry-run output, extracts the first token of every command line, and asserts docker is not among the invocations. (The test extracts the leading verb specifically so an absolute path that happens to contain the substring docker somewhere — for example, a state directory under a docker workspace — does not produce a false negative.) The teardown verb already exists separately as make down, which is unambiguously destructive and never runs without an explicit operator decision.

Test it

.venv/bin/python -m pytest tests/
........................................................................ [ 52%]
................................................................         [100%]
136 passed in 2.92s

Nineteen new tests on top of the one hundred seventeen from steps 1–5, all asserting against make -n dry-run output rather than real make execution. The new region covers: every .PHONY target is declared; help is the default goal; paths are anchored via $(abspath $(dir $(lastword $(MAKEFILE_LIST)))); deploy depends on init; the deploy recipe invokes scripts.rollout; each smoke knob is forwarded only when set; the project name and compose file overrides reach the rollout via environment variables; the PYTHON override wins on hosts with non-standard interpreters; the deploy recipe never names a destructive compose verb; make up is the bootstrap (no app_green mention); and make clean never shells out to docker. The whole suite still finishes in under three seconds because nothing in the new file opens a socket, calls docker, or spawns a real container — make -n does all the heavy lifting and every assertion is a string match on the printed recipe.

What we got

A single, stable entry point — make deploy — that any CI workflow file or SSH operator can call to drive the full five-step rollout end-to-end. The Makefile anchors every path to its own directory, so ssh host make -C /opt/svc deploy works from any login shell. It declares help as the default goal, so a bare make over SSH prints a usage banner instead of silently rolling out. It declares every recipe .PHONY, so a stray file named deploy in the working tree can never short-circuit a release. It forwards each of the five smoke knobs from step 5 only when an override is actually present, so the rollout script's argparse contract is never violated by an empty flag. It exposes a PYTHON override for hosts that ship a versioned interpreter outside the default PATH, a PROJECT_NAME override for parallel deploys, and a COMPOSE_FILE override for CI jobs that build their compose file at runtime. And every property that matters — depends-on-init, no destructive compose verbs, no TTY-requiring flags, no docker call inside clean — is pinned by a dry-run test that runs in milliseconds and needs no Docker daemon. The five-step zero-downtime story is now packaged: one verb, one entry point, identical behaviour from CI and from an operator's terminal.

Repository

The companion code for this article: https://github.com/vytharion/docker-rollout-zero-downtime-compose

The state of the code after this step: 1f06cda

Key commits to step through:

  • 3c9de0a — step 1: bootstrap a baseline docker-compose stack with app and reverse proxy
  • f69b280 — step 2: add healthcheck and readiness probe to the app service
  • 0084713 — step 3: write a blue-green rollout script that spins up the green container alongside blue
  • 287d5fa — step 4: switch traffic atomically at the reverse proxy from blue to green
  • ca5bbdd — step 5: add post-cutover smoke tests with automatic rollback on failure
  • 1f06cda — step 6: package the flow into a make deploy target for CI and SSH reuse