Replicas & Autoscaling

Run multiple instances of a service, spread across regions, scale automatically on load, and roll out safely.

A replica is one running instance of your service (web/worker/cron). Replicas give you capacity, redundancy, and zero-downtime deploys. Forgeon can autoscale replicas up/down based on CPU, memory, RPS, or latency.

Start with min=2 replicas for production web traffic—one to serve, one to roll—then add autoscaling.

Where to manage replicas

Project → Services → [service] → Scaling
- Manual scale: set Min/Max replicas
- Autoscaling policy: choose signals (CPU, Memory, RPS, Latency, Custom)
- Regions: select where replicas run

quick jump

$open

/docs/runtime-metrics

Manual scaling (always available)

Min replicas — the floor (kept running 24/7 in chosen regions)
Max replicas — the ceiling (autoscaler won’t exceed this)
Spread — choose one or more regions; traffic is steered to nearest healthy

min=1 can’t do zero-downtime rollouts. Use min≥2 for production web services.

Autoscaling policies

Turn on Autoscale to add/remove replicas automatically:

CPU target — keep average CPU ~60% (good default for APIs)
Memory target — keep RSS under 75–80% (heap-heavy runtimes)
RPS target — bound requests per replica (e.g., 200/instance)
Latency target — keep p95 below SLO (e.g., 300 ms)
Custom metric — point to an app-emitted gauge (queue depth, concurrency)

Controls

Cooldown — wait before scaling again (e.g., 60–120s)
Step size — add/remove N replicas per decision (e.g., +2/−1)
Max surge — extra replicas allowed during deploy (e.g., 25%)
Max unavailable — replicas allowed to be temporarily out (e.g., 0 for web)

sane defaults

$policycpu_target=60%cooldown=90sstep=+2/-1

$rolloutsurge=25%unavailable=0

Regional replicas

Pick multiple regions for availability and latency.
Forgeon routes users to the nearest healthy replica; failover happens automatically.
Databases: prefer a primary region close to your DB, or use read-replicas if your engine supports it.

Start single-region near your database. Add a second region once p95 latency or regional reliability becomes a concern.

Zero-downtime deploys (rolling)

When you deploy a new revision:

Surge up (optionally) to keep capacity
New replicas must pass readiness (e.g., /readyz → 200)
Traffic shifts to new replicas
Old replicas drain and stop

readiness contract

$/readyz

→ 200 only when DB/cache ready & migrations done

Sticky sessions & websockets

Sticky sessions (optional) keep a user pinned to one replica (cookie-based). Use if you have in-memory sessions.
WebSockets/Server-sent events are supported; use graceful shutdown so connections drain before a replica stops.

Graceful shutdowns (don’t cut users off)

Handle SIGTERM to start draining, close listeners, and finish in-flight work.
Honor a termination grace period (set on the service) so the platform doesn’t force-kill long requests.
Keep /healthz OK until you’re truly dying; keep /readyz false once you start draining.

main.js — Node example

// Node example
const server = app.listen(process.env.PORT, '0.0.0.0')
process.on('SIGTERM', async () => {
  ready = false // your /readyz should now return 503
  server.close(() => process.exit(0))
  setTimeout(() => process.exit(1), 15000) // grace period fallback
})