Replicas & Autoscaling

Run multiple instances of a service, spread across regions, scale automatically on load, and roll out safely.


A replica is one running instance of your service (web/worker/cron). Replicas give you capacity, redundancy, and zero-downtime deploys. Forgeon can autoscale replicas up/down based on CPU, memory, RPS, or latency.

Start with min=2 replicas for production web traffic—one to serve, one to roll—then add autoscaling.

Where to manage replicas

  • Project → Services → [service] → Scaling
    • Manual scale: set Min/Max replicas
    • Autoscaling policy: choose signals (CPU, Memory, RPS, Latency, Custom)
    • Regions: select where replicas run
quick jump

Manual scaling (always available)

  • Min replicas — the floor (kept running 24/7 in chosen regions)
  • Max replicas — the ceiling (autoscaler won’t exceed this)
  • Spread — choose one or more regions; traffic is steered to nearest healthy

min=1 can’t do zero-downtime rollouts. Use min≥2 for production web services.

Autoscaling policies

Turn on Autoscale to add/remove replicas automatically:

  • CPU target — keep average CPU ~60% (good default for APIs)
  • Memory target — keep RSS under 75–80% (heap-heavy runtimes)
  • RPS target — bound requests per replica (e.g., 200/instance)
  • Latency target — keep p95 below SLO (e.g., 300 ms)
  • Custom metric — point to an app-emitted gauge (queue depth, concurrency)

Controls

  • Cooldown — wait before scaling again (e.g., 60–120s)
  • Step size — add/remove N replicas per decision (e.g., +2/−1)
  • Max surge — extra replicas allowed during deploy (e.g., 25%)
  • Max unavailable — replicas allowed to be temporarily out (e.g., 0 for web)
sane defaults
$policycpu_target=60%cooldown=90sstep=+2/-1
$rolloutsurge=25%unavailable=0

Regional replicas

  • Pick multiple regions for availability and latency.
  • Forgeon routes users to the nearest healthy replica; failover happens automatically.
  • Databases: prefer a primary region close to your DB, or use read-replicas if your engine supports it.

Start single-region near your database. Add a second region once p95 latency or regional reliability becomes a concern.

Zero-downtime deploys (rolling)

When you deploy a new revision:

  1. Surge up (optionally) to keep capacity
  2. New replicas must pass readiness (e.g., /readyz → 200)
  3. Traffic shifts to new replicas
  4. Old replicas drain and stop
readiness contract
$/readyz

→ 200 only when DB/cache ready & migrations done

Sticky sessions & websockets

  • Sticky sessions (optional) keep a user pinned to one replica (cookie-based). Use if you have in-memory sessions.
  • WebSockets/Server-sent events are supported; use graceful shutdown so connections drain before a replica stops.

Graceful shutdowns (don’t cut users off)

  • Handle SIGTERM to start draining, close listeners, and finish in-flight work.
  • Honor a termination grace period (set on the service) so the platform doesn’t force-kill long requests.
  • Keep /healthz OK until you’re truly dying; keep /readyz false once you start draining.
main.js — Node example
// Node example
const server = app.listen(process.env.PORT, '0.0.0.0')
process.on('SIGTERM', async () => {
  ready = false // your /readyz should now return 503
  server.close(() => process.exit(0))
  setTimeout(() => process.exit(1), 15000) // grace period fallback
})