Zero-Downtime Deployments with Docker Compose — No Kubernetes Required

There's a mass delusion in the industry that you need Kubernetes to run a serious production service. You don't. At StatusDude, we serve thousands of monitoring checks per minute, run multi-region workers, and deploy multiple times a day — all with Docker Compose and HAProxy. Zero dropped requests. Zero downtime. No etcd to babysit at 3 AM.

But we didn't start with HAProxy. We started with Traefik. That lasted about four hours.

We Tried Traefik First

Traefik is the popular choice for Docker-based setups. It auto-discovers services via Docker labels, has a slick dashboard, and the docs make it look effortless. We set up two backend replicas with Traefik labels, ran a rolling deploy, and watched everything fall apart.

"Service defined multiple times"

Our first deploy strategy was to run a backend_new service alongside the existing backend during the transition. Both had the same Traefik routing labels — same Host rule, same service definition. Makes sense, right? You want both old and new to serve traffic during the cutover.

Traefik disagreed. Its Docker provider treats each Compose service as a separate configuration source. Two services with the same labels? "Service defined multiple times." 404 on every request. No fallback, no merge, just a flat refusal to route anything.

We reworked the approach to use docker compose --scale backend=4 instead of a separate service. That avoided the label conflict. But it uncovered the next problem.

The Scale-Down Race

The rolling deploy strategy: scale up to 4 replicas (2 old + 2 new), then scale back down to 2 (keeping only the new ones). Simple enough.

Except Traefik's internal routing table didn't update fast enough. We'd scale down from 4 to 2, and Traefik would keep routing to containers that were in the process of shutting down. 502s on every other request. The routing state lagged behind Docker's reality by several seconds — long enough to drop a significant chunk of traffic.

We tried adding delays. We tried disconnecting containers from the network before stopping them (so the health check would fail cleanly before removal). We tried passive health checks — added them, then immediately rolled them back because they were too aggressive and caused false positives.

None of it was clean. But the real killer was something else entirely.

The Killer: No Retry on a Different Backend

That's a known issue that devs seem to ignore for a while now... https://github.com/traefik/traefik/issues/2723

Here's the scenario: during a rolling deploy, you stop an old container. docker stop sends SIGTERM. Uvicorn starts its graceful shutdown, but there's a window — requests that are already in-flight, or requests that arrive between the stop signal and Traefik updating its routing table.

When that request hits the dying backend, the connection drops mid-stream. The client gets a raw error — empty response, connection reset, partial body.

We can't have that. When you report your service and heartbeat monitors are up - we need to acknowledge!

Now here's what Traefik does with that failed request: nothing.

Traefik's retry middleware exists, but it retries on the same backend. The one that's dying. The one that will fail again. It doesn't redispatch to a healthy backend. The request is just... lost.

We tried every combination: passive health checks, disconnect-before-stop, retry middleware with different attempts counts. The fundamental problem remained — Traefik couldn't send a failed request to a different server.

That afternoon, we ripped out Traefik and reached for HAProxy.

What You Actually Need

Let's strip it down. What does zero-downtime deployment actually require?

Multiple backend instances — so you can replace one while the other serves traffic
A load balancer that retries on a different backend — so dying containers don't drop requests
A deploy script that replaces instances one at a time — rolling update

That's it. Three things. Let me show you how we do each one.

Step 1: Multiple Replicas with Docker Compose

Docker Compose has a built-in deploy.replicas setting:

# docker-compose.yml

services:
  backend:
    build: ./backend
    deploy:
      replicas: 2
    image: myapp-backend
    expose:
      - "8000"
    env_file: .env
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 5s
      timeout: 5s
      retries: 3
      start_period: 5s
    restart: unless-stopped

That's 2 backend containers running behind a shared Docker DNS name backend. When you resolve backend inside the Docker network, you get both container IPs.

One Dockerfile, one image, two containers. No pod specs, no deployments, no replica sets.

Step 2: HAProxy as the Load Balancer

HAProxy is battle-tested, fast, and the configuration is readable. But the real reason we chose it: option redispatch.

global
    log stdout format raw local0 info
    maxconn 4096

defaults
    mode http
    timeout connect 3s
    timeout client  30s
    timeout server  30s

    # THE key feature: retry failed requests on a DIFFERENT backend
    retries 3
    option redispatch 1
    retry-on conn-failure empty-response response-timeout 502 503 504

resolvers docker_dns
    nameserver dns1 127.0.0.11:53
    resolve_retries 3
    timeout resolve 1s
    timeout retry   1s
    # Re-resolve DNS every 2 seconds
    hold valid 2s

frontend http_in
    bind *:80
    default_backend backends

backend backends
    balance roundrobin
    option httpchk
    http-check send meth GET uri /health
    http-check expect status 200

    default-server inter 1s fall 1 rise 1 check resolvers docker_dns \
        resolve-prefer ipv4 init-addr none \
        observe layer7 error-limit 3 on-error mark-down

    server-template backend 1-10 backend:8000 check

Let's talk about the three things that make this work.

Retry on a Different Backend

This is the feature that Traefik couldn't deliver:

retries 3
option redispatch 1
retry-on conn-failure empty-response response-timeout 502 503 504

When a request fails — connection refused, empty response, 502, 503, 504 — HAProxy retries it. And option redispatch 1 means every retry goes to a different backend. Not the same dying server. A different, healthy one.

Makes sense, right?!

During a rolling deploy, if a request hits a container that's shutting down and gets an empty response, HAProxy silently retries on the other replica. The client never sees the error. No dropped requests. This single feature eliminated every problem we had with Traefik.

Three Layers of Health Detection

We don't rely on a single health check mechanism. There are three independent layers, each catching different failure modes:

Layer 1 — Per-request retry (milliseconds): If a single request fails, retry immediately on a different backend. Catches transient failures during deploys.

Layer 2 — Passive observation (observe layer7): HAProxy watches actual HTTP responses from real traffic. If a backend returns 3 consecutive 5xx errors (error-limit 3), it's pulled from rotation instantly (on-error mark-down). No waiting for any probe cycle.

Layer 3 — Active health checks (inter 1s fall 1 rise 1): Probes the /health endpoint every second. Catches completely dead backends that receive no traffic. One failure = instant DOWN. One success = back in rotation.

Each layer covers a blind spot of the others. Per-request retry handles the single request that hits a dying backend. Passive checks handle backends that start returning errors under load. Active checks handle backends that crash silently with no traffic flowing to them.

DNS-Based Discovery (No Docker Socket)

The server-template backend 1-10 backend:8000 check line is how HAProxy discovers backends. It resolves the Docker DNS name backend using Docker's embedded DNS resolver (127.0.0.11:53) and creates server entries for each IP it finds.

The hold valid 2s means HAProxy re-resolves every 2 seconds. Container dies? Its IP disappears from DNS. New container starts? Its IP appears. HAProxy picks it up automatically.

No Docker socket mount. No label parsing. No dynamic config generation. A static config file that just works. No service mesh. No sidecar. No operator. Srsly.

Step 3: The Rolling Deploy

This is the entire deploy script:

prod-deploy:
	@echo "=== Zero-downtime rolling deploy ==="
	@for cid in $$(docker compose -f docker-compose.prod.yml ps -q backend); do \
		echo "Replacing $$cid..."; \
		docker stop $$cid && docker rm -f $$cid; \
		docker compose -f docker-compose.prod.yml up -d --no-deps --no-recreate --wait backend; \
	done
	@echo "=== Deploy complete ==="

That's it. Let me walk through what happens:

Get the container IDs of all running backend replicas
For each replica, one at a time:
- Stop and remove the container
- HAProxy detects the missing backend within 2 seconds (DNS re-resolution)
- Traffic shifts to the remaining healthy replica
- Start a new container with the updated image
- --wait blocks until Docker's healthcheck passes
- HAProxy discovers the new backend via DNS
- Traffic starts flowing to the new container
Move to the next replica

At every point during the deploy, at least one healthy backend is serving traffic. The --no-recreate flag prevents Docker from touching the replica we haven't replaced yet.

Any requests that hit the dying container during that 2-second DNS window? Retried on the healthy replica automatically. The client never knows.

Our Setup in Numbers

At StatusDude, this setup handles:

Thousands of monitoring checks per minute across 3 regions
Multiple deploys per day with zero dropped requests
Sub-2-second failover when a backend goes down
~60 lines of HAProxy config and a 10-line deploy script

We went from Traefik (404s, 502s, dropped requests, four hours of debugging) to HAProxy (zero dropped requests, first deploy) in one afternoon. Sometimes the boring, battle-tested tool is the right choice. Well, OK, not "sometimes" - quite often ;-)

P.S Nginx would do too, I just felt like getting haproxy up this time :)